automatic backups of Azure blob storage? - azure

I need to do an automatic periodic backup of an Azure blob storage to another Azure blob storage.
This is in order to guard against any kind of malfunction in the software.
Are there any services which do that? Azure doesn't seem to have this

As #Brent mentioned in the comments to Roberto's answer, the replicas are for HA; if you deleted a blob, that delete is replicated instantly.
For blobs, you can very easily create asynchronous copies to a separate blob (even in a separate storage account). You can also make snapshots which capture a blob at a current moment in time. At first, snapshots don't cost anything, but if you start modifying the blocks/pages referred to by the snapshot, then new blocks/pages are allocated. Over time, you'll want to start purging your snapshots. This is a great way to keep data "as-is" over time and revert back to a snapshot if there's a malfunction in your software.
With queues, the malfunction story isn't quite the same, as typically you'd only have a small number of queue items present (at least that's the hope; if you have thousands of queue messages, this is typically a sign that your software is falling behind). In any event: You could, when writing queue messages, write your queue messages to blob storage, for archive purposes, in case there's a malfunction. I wouldn't recommend using blob- based messaging for scaling/parallel processing, since they don't have the mechanisms in place that queues do, but you could use them manually in case of malfunction.
There's no copy function for tables. You'd need to write to two tables during your write.

Azure keeps 3 redundant copies of your data in different locations in the same data centre where your data is hosted (to guard against hardware failure).
This applies to blob, table and queue storage.
Additionally, You can enable geo-replication on all of your storage. Azure will automatically keep redundant copies of your data in separate data centres. This guards against anything happening to the data centre itself.
See Here

Related

Azure ZRS/GRS vs snapshots

Why would I need to create a blob snapshot and incur additional cost if Azure already provides GRS(Geo redundant storage) or ZRS (Zone redundant storage)?
Redundancy (ZRS/GRS/RAGRS) provides means to achieve high availability of your resources (blobs in your scenario). By enabling redundancy you are ensuring that a copy of your blob is available in another region/zone in case primary region/zone is not available. It also ensures against data corruption of the primary blob.
When you take a snapshot of your blob, a readonly copy of that blob in its current state is created and stored. If needed, you can restore a blob from a snapshot. This scenario is well suited if you want to store different versions of the same blob.
However, please keep in mind that neither redundancy nor snapshot is backup because if you delete base blob, all the snapshots associated with that blob are deleted and all the copies of that blob available in other zones/regions are deleted as well.
I guess you need to understand the difference between Backup and Redundancy.
Backups make sure if something is lost, corrupted or stolen, that a copy of the data is available at your disposal.
Redundancy makes sure that if something fails—your computer fails, a drive gets fried, or a server freezes and you are able to work regardless of the problem. Redundancy means that all your changes are replicated to another location. In case of a failover, your slave can theoretically function as a master and serve the (hopefully) latest state of your file system.
You could also turn soft delete on. That would keep a copy of every blob for every change made to it, even if someone deletes it. Then you set the retention period for those blobs so they would be automatically removed after some period of time.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-soft-delete

Azure Storage: Table vs Blob

Here is the problem. I have the devices pushing telemetry messages to Azure IoT hub and currently, I save all messages to the Table Storage with partition key device Id and row key telemetry kind. What I want to do is restrict the size of stored data. For instance, the table should keep only up to 50 MB and the should be cleared. What kind of storage should I use for such use case and what are the benefits? Any suggestions are highly appreciated.
Neither Azure Tables nor Azure Blobs have the feature where the content automatically gets deleted after a certain size is reached. In fact, I don't think I have come across any cloud storage solution that offers it (I've seen the data gets automatically deleted based on age).
Thus if you want to delete the data once it reaches a certain size, you will have to write some code and schedule it (using either Functions or WebJobs). That code will find the size occupied and delete the data going over the limit.
Between Blobs and Tables, I am somewhat conflicted. With Blobs, it is much easier to get the storage consumed - You just list the blobs in a container and sum up the size of the blobs. With tables, you will need to keep on fetching entities (i.e. download the data) and calculate the size of that data. But then deleting data from tables is easier as you will be deleting rows (unless you store each record in a separate blob).
If it were not on the data size and rather based on the data age, I would have recommended Cosmos DB. Though more expensive than Azure Storage, but you could define TTL at the collection level and based on that policy, the documents will be automatically deleted.

Are Azure Blob copy operations cheap?

Azure Blob Storage does not expose any kind of "blob rename" operation - which sounds preposterous because the idea of renaming an entity is a fundamental operation in almost any storage system - and Azure's documentation makes no reference to how a blob's name is used internally (e.g. as DHT key), but as we can specify our own names it's clear that Azure isn't using a content-addressable storage model (so renaming should be possible, once the Azure Storage team decides to allow it).
Microsoft advocates instead that to "rename" a blob, you simply copy it, then delete the original - which seems incredibly inefficient - for example, if you have a 200GB video file blob with a typo in the blob name - unless internally Azure has some kind of dedupe system - in which case it makes perfect sense to eliminate the special-case of "blob renaming" because internally it really would be a "name copy" operation.
Unfortunately the current documentation for blob copy ( https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/copy-blob ) does not describe any internal processes, and in-fact, suggests that the blob copy might be a very long operation:
State of the copy operation, with these values:
success: the copy completed successfully.
pending: the copy is in progress.
If it was using a dedupe system internally then all blob copy operations would be instantaneous so there would be no need for an "in progress" status; also confusingly it uses "pending" to refer to "in progress" - when normally "pending" means "enqueued, not starting yet".
Alarmingly, the documentation also states this:
A copy attempt that has not completed after 2 weeks times out and leaves an empty blob
...which can be taken to read that there are zero guarantees about the time it takes to copy a blob. There is nothing in the page to suggest smaller blobs are copied quicker compared to bigger blobs - so for some reason (such as a long queue, unfortunate outages, and so on) it could take 2 weeks to correct my hypothetical typo in my hypothetical 200GB video file - and don't forget that I cannot delete my original misnamed blob until the copy operation is completed - which means needing to design my client software to constantly check and eventually issue the delete operation (and to ensure my software runs continuously for up to 2 weeks...).
Is there any authoritative information regarding the runtime characteristics and nature of Azure Blob copy operations?
As you may already know that Copy Blob operation is an asynchronous operation and all the things you mentioned above are true with one caveat. The copy operation is synchronous when it comes to copying within same storage account. Even though you get the same state whether you're copying blobs across storage accounts or within a storage account but when this operation is performed in the same storage account, it happens almost instantaneously.
So when you rename a blob, you're creating a copy of the blob in the same storage account (even same container) which is instantaneous. I am not 100% sure about the internal implementation but if I am not mistaken when you copy a blob in the same storage account, it doesn't copy the bytes in some separate place. It just create 2 pointers (new blob and the old blob) pointing to the same storage data. Once you start making changes to the blobs I think at that time it goes and changes those bytes.
For internal understanding of Azure Storage, I would highly recommend that you read the paper published by the team a few years ago. Please look at my answer here which has links to this paper: Azure storage underlying technology.

How to clone blob container and contents

Summary
I have picked up support for a fairly old website which stores a bunch of blobs in Azure. What I would like to do is duplicate all of my blobs from live to the test environment so I can use them without affecting users.
Architecture
The website is a mix of VB webforms and MVC, communicating with an Azure blob service (e.g. https://x.blob.core.windows.net/LiveBlobs).
The test site mirrors the live setup, except it points to a different blob container in the same storage account (e.g. https://x.blob.core.windows.net/TestBlobs)
Questions
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
How do I work out what it will cost to do this? The live blob
storage is roughly 130GB, but it should just be copying the data within the same data centre right?
Things I've investigated
I've spent quite some time searching for an answer, but what I've found deals with copying between storage accounts or copying single blobs.
I've also found AzCopy which looks promising but it looks like it would copy the files one by one so I'm worried it would end up taking a long time and costing a lot.
I am fairly new to Azure so please forgive me if this is a silly question or I've missed out some important details. I'm more than happy to add any extra information should you need it.
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
Yes, you can. Copying blob is an asynchronous server-side operation. You simply tell the blob service the blobs to copy & destination details and it will do the job for you. No need to download first and upload them to destination.
How do I work out what it will cost to do this? The live blob storage
is roughly 130GB, but it should just be copying the data within the
same data centre right?
So there are 3 things you need to consider when it comes to costing: 1) Storage costs, 2) transaction costs and 3) data egress costs.
Since the copied blobs will be stored somewhere, they will be consuming storage and you will incur storage costs.
Copy operation will perform some read operations on source blobs and then write operation on destination blobs (to create them), so you will have to incur transaction costs. At very minimum for each blob copy, you can expect 2 transactions - read on source and write on destination (though there can be more transactions).
You incur data egress costs if the destination storage account is not in the same region as your source storage account. As long as both storage accounts are in the same region, you would not incur this.
You can use Azure Storage Pricing Calculator to get an idea about how much it is going to cost you.
I've also found AzCopy which looks promising but it looks like it
would copy the files one by one so I'm worried it would end up taking
a long time and costing a lot.
Blobs are always copied one-by-one. Copying across storage accounts is always async server side operation so you can't really predict how much time it would take for the copy operation to complete but in my experience it is quite fast. If you want to control when the blobs are copied, you would need to download them first and upload them. AzCopy supports this mode as well.
As far as costs are concerned, I think it is a relative term when you say it is going to cost a lot. But in general Azure Storage is very cheap and 130 GB is not a whole lot of data.

best design solution to migrate data from SQL Azure to Azure Table

In our service, we are using SQL Azure as the main storage, and Azure table for the backup storage. Everyday about 30GB data is collected and stored to SQL Azure. Since the data is no longer valid from the next day, we want to migrate the data from SQL Azure to Azure table every night.
The question is.. what would be the most efficient way to migrate data from Azure to Azure table?
The naive idea i came up with is to leverage the producer/consumer concept by using IDataReader. That is, first get a data reader by executing "select * from TABLE" and put data into a queue. At the same time, a set of threads are working to grab data from the queue, and insert them into Azure Table.
Of course, the main disadvantage of this approach (i think) is that we need to maintain the opened connection for a long time (might be several hours).
Another approach is to first copy data from SQL Azure table to local storage on Windows Azure, and use the same producer/consumer concept. In this approach we can disconnect the connection as soon as the copy is done.
At this point, i'm not sure which one is better, or even either of them is a good design to implement. Could you suggest any good design solution for this problem?
Thanks!
I would not recommend using local storage primarily because
It is transient storage.
You're limited by the size of local storage (which in turn depends on the size of the VM).
Local storage is local only i.e. it is accessible only to the VM in which it is created thus preventing you from scaling out your solution.
I like the idea of using queues, however I see some issues there as well:
Assuming you're planning on storing each row in a queue as a message, you would be performing a lot of storage transactions. If we assume that your row size is 64KB, to store 30 GB of data you would be doing about 500000 write transactions (and similarly 500000 read transactions) - I hope I got my math right :). Even though the storage transactions are cheap, I still think you'll be doing a lot of transactions which would slow down the entire process.
Since you're doing so many transactions, you may get hit by storage thresholds. You may want to check into that.
Yet another limitation is the maximum size of a message. Currently a maximum of 64KB of data can be stored in a single message. What would happen if your row size is more than that?
I would actually recommend throwing blob storage in the mix. What you could do is read a chunk of data from SQL table (say 10000 or 100000 records) and save that data in blob storage as a file. Depending on how you want to put the data in table storage, you could store the data in CSV, JSON or XML format (XML format for preserving data types if it is needed). Once the file is written in blob storage, you could write a message in the queue. The message will contain the URI of the blob you've just written. Your worker role (processor) will continuously poll this queue, get one message, fetch the file from blob storage and process that file. Once the worker role has processed the file, you could simply delete that file and the message.

Resources