Automatically Delete/Expire Azure Blobs after a time period - azure

With Azure Blob storage is it possible to either have an individual blob or all blobs within a container delete themselves after a certain period of time similar to Amazon AWS S3's Object Expiration Feature? Or does Azure storage not provide such functionality?

It is possible with the Azure Blob storage lifecycle. Please take a look here
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-lifecycle-management-concepts?tabs=azure-portal
and Configure a lifecycle management policy

Because I've missed the feature for years I wrote a small project with a nice 'Deploy to Azure button'. Not yet perfect but works https://github.com/nulllogicone/ExpireBlobFunction
And now I see that Microsoft has released this as a feature on March 27, 2019.
Excerpt from that article:
Azure Blob storage lifecycle management offers a rich, rule-based
policy for GPv2 and Blob storage accounts. Use the policy to
transition your data to the appropriate access tiers or expire at the
end of the data's lifecycle.
The lifecycle management policy lets you:
Transition blobs to a cooler storage tier (hot to cool, hot to archive, or cool to archive) to optimize for performance and cost
Delete blobs at the end of their lifecycles
Define rules to be run once per day at the storage account level Apply rules to containers or a subset of blobs (using prefixes as filters)

You can auto-delete in various ways. Since long time you can do it even with Logical App but, sometimes, it is not so clear.
Today you have it directly available in your storage account:
and there you have an easy and specific task generator (template) to delete old blobs.

Looks like the feature is planned now at least: https://feedback.azure.com/forums/217298-storage/suggestions/2474308-provide-time-to-live-feature-for-blobs

The Azure Storage Team recently posted (Oct 5, 2017) an update on expiring blobs. It seems that this is now possible using an Azure Logic App template and they will have a native blob storage solution later this year.
Link: Provide Time to live feature for Blobs
We are pleased to announce that we have made an Azure Logic Apps template available to expire old blobs. To set up this automated solution in your environment: Create a new Logic Apps instance, select the “Delete old Azure blobs” template, customize and run. We will release a blog post detailing instructions and providing more templates in the coming weeks.
Allowing users to define expiration policies on blobs natively from storage is still planned for the coming year. As soon as we have progress to share, we will do so. We will continue to provide updates at least once per quarter.
For any further questions, or to discuss your specific scenario, send us an email at azurestoragefeedback#microsoft.com.

Azure Storage does not have an expiration feature; you must delete blobs via your app. How you do this is up to you; you'll need to store your expiration date target somewhere (whether in a database or in blob properties).
You can effectively create TTL on blob access, via Shared Access Signatures (by setting an end-date on the SAS). This would let you have an effective way of removing access when it's time to remove access, and then have a follow-on process remove the now-expired blobs.

Yes, it's possible. Refer these two, it was hard finding out the sample code.
Rules reference: https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview?tabs=azure-portal
Python sample code reference: https://github.com/Azure-Samples/azure-samples-python-management/blob/master/samples/storage/manage_management_policy.py
Code snippet I used:
def add_expiry_rule(self):
token_credential = ClientSecretCredential(
tenant_id=tenant_id,
client_id=client_id,
client_secret=client_secret,
)
storage_client = StorageManagementClient(
credential=token_credential, subscription_id=subscription_id
)
rule = {
"id": "test",
"prefix": "test/",
"expiration": 91,
}
azure_rule = {
"enabled": True,
"name": rule.get("id"),
"type": "Lifecycle",
"definition": {
"filters": {"blob_types": ["blockBlob"], "prefix_match": [rule.get("prefix")]},
"actions": {
"base_blob": {
"delete": {
"days_after_modification_greater_than": str(rule.get("expiration"))
}
}
},
},
}
try:
management_policy = storage_client.management_policies.get(
group_name, storage_account, "default"
)
existing_rules = management_policy.policy.as_dict()
existing_rules.get("rules").append(azure_rule)
management_policy_rules = existing_rules
except Exception as e:
management_policy_rules = {"rules": [azure_rule]}
try:
management_policy = storage_client.management_policies.create_or_update(
group_name,
storage_account,
"default",
{"policy": management_policy_rules},
)
print("Azure: Added rule {} successfully".format(rule.get("id")))
except Exception as e:
if e.message.endswith("conflicting rule name."):
print("Azure: Rule ID: {} exists".format(rule.get("id")))
else:
raise Exception("Azure: Error adding rule. {}".format(e.message))

Related

Azure Logic App with Azure Blob Storage Action: Getting 429 statusCode error

I am using Azure Logic App with Azure BLOB Storage trigger.
When a blob is updated or modified in Azure Storage, I pull the content of blob created or modified from Storage, do some transformations on data and push it back to Azure Storage as new blob content using Create Content - Azure Blob Storage action of LogicApp.
With large number of blobs inserted (for example 10000 files) or updated into blob storage, Logic App gets triggered multiple runs as expected for these inserted blobs, but the further Azure Blob Actions fail with following error:
{
"statusCode": 429,
"message": "Rate limit is exceeded. Try again in 16 seconds."
}
Did someone face similar issue in Logic App? If yes, can you suggest what could be the possible reason and probable fix.
Thanks
Seems like you are hitting the rate limits on the Azure Blob Managed API.
Please refer to Jörgen Bergström's blog about this: http://techstuff.bergstrom.nu/429-rate-limit-exceeded-in-logic-apps/
Essentially he says you can setup multiple API connections that do the same thing and then randomize the connection in the logic app code view to randomly use one of those connection which will eliminate the rate exceeding issue.
An example of this (I was using SQL connectors) is see below API connections I setup for my logic app. You can do the same with a blob storage connection and use a similar naming convention e.g. blob_1, blob_2, blob_3, ... and so on. You can create as many as you would like, I created 10 for mine:
You would then in your logic app code view replace all your current blob connections e.g.
#parameters('$connections')['blob']['connectionId']
Where "blob" is your current blob api connection with the following:
#parameters('$connections')[concat('blob_',rand(1,10))]['connectionId']
And then make sure to add all your "blob_" connections at the end of your code:
"blob_1": {
"connectionId": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Web/connections/blob-1",
"connectionName": "blob-1",
"id": "/subscriptions/.../providers/Microsoft.Web/locations/.../managedApis/blob"
},
"blob_2": {
"connectionId": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Web/connections/blob-2",
"connectionName": "blob-2",
"id": "/subscriptions/.../providers/Microsoft.Web/locations/.../managedApis/blob"
},
...
The logic app would then randomize which connection to use during the run eliminating the 429 rate limit error.
Please check this doc: https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits
For each Azure subscription and tenant, Resource Manager allows up to
12,000 read requests per hour and 1,200 write requests per hour.
you can check the usage by:
response.Headers.GetValues("x-ms-ratelimit-remaining-subscription-reads").GetValue(0)
or
response.Headers.GetValues("x-ms-ratelimit-remaining-subscription-writes").GetValue(0)

Retention Policy for Azure Containers?

I'm looking to set up a policy for one of my containers so it deletes or only retains data for x days. So if x is 30, that container should only contain files that are less than 30 days old. If the files are sitting in the container for more than 30 days it should discard them. Is there any way I can configure that?
Currently such kind of thing is not supported by Azure Blob Storage. You would need to write something of your own that would run periodically to do this check and delete old blobs.
On a side note, this feature has been long pending (since 2011): https://feedback.azure.com/forums/217298-storage/suggestions/2474308-provide-time-to-live-feature-for-blobs.
UPDATE
If you need to do it yourself, there are two things to consider:
Code to fetch the list of blobs, find out the blobs that need to be deleted and then delete those blobs. To do this, you can use Azure Storage SDK. Azure Storage SDK is available for many programming languages like .Net, Java, Node, PHP etc. You just need to use the one that you're comfortable with.
Schedule this code to run once on a daily basis: To do this, you can use one of the many services available in Azure. You can use Azure WebJobs, Functions, Schedular, Azure Automation etc.
If you decide to use Azure Automation, there's a Runbook already available for you that you can use (no need to write your code). You can find more details about this here: https://gallery.technet.microsoft.com/scriptcenter/Remove-Storage-Blobs-that-aae4b761.
Azure Blob storage Lifecycle (Preview) is now available and using that we create Policies with different rules.
Here is the rule to delete the blobs which are older than 30 days
{
"version": "0.5",
"rules": [
{
"name": "expirationRule",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": [ "blockBlob" ]
},
"actions": {
"baseBlob": {
"delete": { "daysAfterModificationGreaterThan": 30}
}
}
}
}
]
}
For more details refer this Azure Blob storage Lifecycle

Why can't configure Azure diagnostics to use Azure Table Storage via new Azure Portal?

I am developing a web api which will be hosted in Azure. I would like to use Azure diagnostics to log errors to Azure table storage.
In the Classic portal, I can configure the logs to go to Azure table storage.
Classic Portal Diagnostic Settings
However in the new Azure portal, the only storage option I have is to use Blob storage:
New Azure Portal Settings
It seems that if I was to make use of a web role, I could configure the data store for diagnostics but as I am developing a web api, I don't want to create a separate web role for every api just so that I can log to an azure table.
Is there a way to programmatically configure azure diagnostics to propagate log messages to a specific data store without using a web role? Is there any reason why the new Azure portal only has diagnostic settings for blob storage and not table storage?
I can currently work around the problem by using the classic portal but I am worried that table storage for diagnostics will eventually become deprecated since it hasn't been included in the diagnostic settings for the new portal.
(I'll do some necromancy on this question as this was the most relevant StackOverflow question I found while searching for a solution to this as it is no longer possible to do this through the classic portal)
Disclaimer: Microsoft has seemingly removed support for logging to Table in the Azure Portal, so I don't know if this is deprecated or will soon be deprecated, but I have a solution that will work now (31.03.2017):
There are specific settings determining logging, I first found out information on this from an issue in the Azure Powershell github: https://github.com/Azure/azure-powershell/issues/317
The specific settings we need are (from github):
AzureTableTraceEnabled = True, & AppSettings has:
DIAGNOSTICS_AZURETABLESASURL
Using the excellent resource explorer (https://resources.azure.com) under (GUI navigation):
/subscriptions/{subscriptionName}/resourceGroups/{resourceGroupName}/providers/Microsoft.Web/sites/{siteName}/config/logs
I was able to find the Setting AzureTableTraceEnabled in the Properties.
The property AzureTableTraceEnabled has Level and sasURL. In my experience updating these two values (Level="Verbose",sasUrl="someSASurl") will work, as updating the sasURL sets DIAGNOSTICS_AZURETABLESASURL in appsettings.
How do we change this? I did it in Powershell. I first tried the cmdlet Get-AzureRmWebApp, but could not find what i wanted - the old Get-AzureWebSite does display AzureTableTraceEnabled, but I could not get it to update (perhaps someone with more powershell\azure experience can come with input on how to use the ASM cmdlets to do this).
The solution that worked for me was setting the property through the Set-AzureRmResource command, with the following settings:
Set-AzureRmResource -PropertyObject $PropertiesObject -ResourceGroupName "$ResourceGroupName" -ResourceType Microsoft.Web/sites/config -ResourceName "$ResourceName/logs" -ApiVersion 2015-08-01 -Force
Where the $PropertiesObject looks like this:
$PropertiesObject = #{applicationLogs=#{azureTableStorage=#{level="$Level";sasUrl="$SASUrl"}}}
The Level corresponds to "Error", "Warning", "Information", "Verbose" and "Off".
It is also possible to do this in the ARM Template (important bits is in properties on the logs resource in the site):
{
"apiVersion": "2015-08-01",
"name": "[variables('webSiteName')]",
"type": "Microsoft.Web/sites",
"location": "[resourceGroup().location]",
"tags": {
"displayName": "WebApp"
},
"dependsOn": [
"[resourceId('Microsoft.Web/serverfarms/', variables('hostingPlanName'))]"
],
"properties": {
"name": "[variables('webSiteName')]",
"serverFarmId": "[resourceId('Microsoft.Web/serverfarms', variables('hostingPlanName'))]"
},
"resources": [
{
"name": "logs",
"type": "config",
"apiVersion": "2015-08-01",
"dependsOn": [
"[resourceId('Microsoft.Web/sites/', variables('webSiteName'))]"
],
"tags": {
"displayName": "LogSettings"
},
"properties": {
"azureTableStorage": {
"level": "Verbose",
"sasUrl": "SASURL"
}
}
}
}
The issue with doing it in ARM is that I've yet to find a way to generate the correct SAS, it is possible to fetch out Azure Storage Account keys (from: ARM - How can I get the access key from a storage account to use in AppSettings later in the template?):
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "[concat('DefaultEndpointsProtocol=https;AccountName=', variables('storageAccountName'),';AccountKey=',listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('storageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value)]"
}
}
There are also some clever ways of generating them using linked templates (from: http://wp.sjkp.dk/service-bus-arm-templates/).
The current solution I went for (time constraints) was a custom Powershell script that looks something like this:
...
$SASUrl = New-AzureStorageTableSASToken -Name $LogTable -Permission $Permissions -Context $StorageContext -StartTime $StartTime -ExpiryTime $ExpiryTime -FullUri
$PropertiesObject = #{applicationLogs=#{azureTableStorage=#{level="$Level";sasUrl="$SASUrl"}}}
Set-AzureRmResource -PropertyObject $PropertiesObject -ResourceGroupName "$ResourceGroupName" -ResourceType Microsoft.Web/sites/config -ResourceName "$ResourceName/logs" -ApiVersion 2015-08-01 -Force
...
This is quite an ugly solution, as it is something extra you need to maintain in addition to the ARM template - but it is easy, fast and it works while we wait for updates to the ARM Templates (or for someone cleverer than I to come and enlighten us).
We don't typically recommend using Tables for log data - it can result in the append only pattern which at scale doesn't work effectively for Table Storage. See the log-data anti-pattern in this guide Table Design Guide. Often times we see that even though people think of log data as structured - they way they typically query it makes Blobs more efficient.
Excerpt from the Design Guide:
Log data anti-pattern
Typically, you should use the Blob service instead of the Table service to store log data.
Context and problem
A common use case for log data is to retrieve a selection of log entries for a specific date/time range: for example, you want to find all the error and critical messages that your application logged between 15:04 and 15:06 on a specific date. You do not want to use the date and time of the log message to determine the partition you save log entities to: that results in a hot partition because at any given time, all the log entities will share the same PartitionKey value (see the section Prepend/append anti-pattern).
...
Solution
The previous section highlighted the problem of trying to use the Table service to store log entries and suggested two, unsatisfactory, designs. One solution led to a hot partition with the risk of poor performance writing log messages; the other solution resulted in poor query performance because of the requirement to scan every partition in the table to retrieve log messages for a specific time span. Blob storage offers a better solution for this type of scenario and this is how Azure Storage Analytics stores the log data it collects.
This section outlines how Storage Analytics stores log data in blob storage as an illustration of this approach to storing data that you typically query by range.
Storage Analytics stores log messages in a delimited format in multiple blobs. The delimited format makes it easy for a client application to parse the data in the log message.
Storage Analytics uses a naming convention for blobs that enables you to locate the blob (or blobs) that contain the log messages for which you are searching. For example, a blob named "queue/2014/07/31/1800/000001.log" contains log messages that relate to the queue service for the hour starting at 18:00 on 31 July 2014. The "000001" indicates that this is the first log file for this period. Storage Analytics also records the timestamps of the first and last log messages stored in the file as part of the blob’s metadata. The API for blob storage enables you locate blobs in a container based on a name prefix: to locate all the blobs that contain queue log data for the hour starting at 18:00, you can use the prefix "queue/2014/07/31/1800."
Storage Analytics buffers log messages internally and then periodically updates the appropriate blob or creates a new one with the latest batch of log entries. This reduces the number of writes it must perform to the blob service.
If you are implementing a similar solution in your own application, you must consider how to manage the trade-off between reliability (writing every log entry to blob storage as it happens) and cost and scalability (buffering updates in your application and writing them to blob storage in batches).
Issues and considerations
Consider the following points when deciding how to store log data:
If you create a table design that avoids potential hot partitions, you may find that you cannot access your log data efficiently.
To process log data, a client often needs to load many records.
Although log data is often structured, blob storage may be a better solution.

What is the best way to backup Azure Blob Storage contents

I know that the Azure Storage entities (blobs, tables and queues) have a built-in resiliency, meaning that they are replicated to 3 different servers in the same datacenter. On top of that they may also be replicated to a different datacenter altogether that is physically located in a different geographical region. The chance of losing your data in this case is close to zero for all practical purposes.
However, what happens if a sloppy developer (or the one under the influence of alcohol :)) accidentally deletes the storage account through the Azure Portal or the Azure Storage Explorer tool? Worst yet, what if a hacker gets hold of your account and clears the storage? Is there a way to retrieve the gigabytes of deleted blobs or is that it? Somehow I think there has to be an elegant solution that Azure infrastructure provides here but I cannot find any documentation.
The only solution I can think of is to write my own process (worker role) that periodically backs up my entire storage to a different subscription/account, thus essentially doubling the cost of storage and transactions.
Any thoughts?
Regards,
Archil
Depending on where you want to backup your data, there are two options available:
Backing up data locally - If you wish to backup your data locally in your infrastructure, you could:
a. Write your own application using either Storage Client Library or consuming REST API or
b. Use 3rd party tools like Cerebrata Azure Management Cmdlets (Disclosure: I work for Cerebrata).
Backing up data in the cloud - Recently, Windows Azure Storage team announced Asynchronous Copy Blob functionality which will essentially allow you to copy data from one storage account to another storage account without downloading the data locally. The catch here is that your target storage account should be created after 7th June 2012. You can read more about this functionality on Windows Azure Blog: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/06/12/introducing-asynchronous-cross-account-copy-blob.aspx.
Hope this helps.
The accepted answer is fine, but it took me a few hours to decipher through everything.
I've put together solution which I use now in production. I expose method Backup() through Web Api which is then called by an Azure WebJob every day (at midnight).
Note that I've taken the original source code, and modified it:
it wasn't up to date so I changed few method names
added retry copy operation safeguard (fails after 4 tries for the same blob)
added a little bit of logging - you should swap it out with your own.
does the backup between two storage accounts (replicating containers & blobs)
added purging - it gets rid of old containers that are not needed (keeps 16 days worth of data). you can always disable this, as space is cheap.
the source can be found from: https://github.com/ChrisEelmaa/StackOverflow/blob/master/AzureStorageAccountBackup.cs
and this is how I use it in the controller (note your controller should be only callable by the azure webjob - you can check credentials in the headers):
[Route("backup")]
[HttpPost]
public async Task<IHttpActionResult> Backup()
{
try
{
await _blobService.Backup();
return Ok();
}
catch (Exception e)
{
_loggerService.Error("Failed to backup blobs " + e);
return InternalServerError(new Exception("Failed to back up blobs!"));
}
}
note: I wanted to add this code as part of the post, but wasted 6 minutes trying to get that code into this post, but failed. the formatting didn't work at all, and it broke completely.
I have used Azure Data Factory to backup Azure storage with great effect. It's really easy to use, cost effective and work very well.
Simply create a Data Factory (v2), set up data connections to your data sources (it currently supports Azure Tables, Azure Blobs and Azure Files) and then set up a data copy pipeline.
The pipelines can merge, overwrite, etc. and you can set up custom rules/wildcards.
Once you've set up the pipeline, you should then set up a schedule trigger. This will kick off the backup at an interval to suit your needs.
I've been using it for months and it's perfect. No code, no VMS, no custom PowerShell scripts or third party software. Pure Azure solution.
I have had exactly the same requirements: backing up blobs from Azure as we have millions of blobs of customers and you are right - a sloppy developer with full access can compromise the entire system.
Thus, I wrote an entire application "Blob To Local Backup", free and open source on github under the MIT license: https://github.com/smartinmedia/BlobToLocalBackup
It solves many of your issues, namely:
a) you can give only READ-access to this application, so that the application cannot destroy any data on Azure
b) backup to a server, where your sloppy developer or the hacker does not have the same access as to your Azure account.
c) The software provides versioning, so you can even protect yourself from e. g. ransom/encryption attacks.
d) I included a serialization method instead of a database, so you can even have millions of files on Azure and you are still able to keep the synch (we have 20 million files on Azure).
Here is how it works (for more detailed information, read the README on github):
You set up the appsettings.json file in the main folder. You can give the LoginCredentials here for the entire access or do it more granular on a storage account level:
{
"App": {
"ConsoleWidth": 150,
"ConsoleHeight": 42,
"LoginCredentials": {
"ClientId": "2ab11a63-2e93-2ea3-abba-aa33714a36aa",
"ClientSecret": "ABCe3dabb7247aDUALIPAa-anc.aacx.4",
"TenantId": "d666aacc-1234-1234-aaaa-1234abcdef38"
},
"DataBase": {
"PathToDatabases": "D:/temp/azurebackup"
},
"General": {
"PathToLogFiles": "D:/temp/azurebackup"
}
}
}
Set up a job as a JSON file like this (I have added numerous options):
{
"Job": {
"Name": "Job1",
"DestinationFolder": "D:/temp/azurebackup",
"ResumeOnRestartedJob": true,
"NumberOfRetries": 0,
"NumberCopyThreads": 1,
"KeepNumberVersions": 5,
"DaysToKeepVersion": 0,
"FilenameContains": "",
"FilenameWithout": "",
"ReplaceInvalidTargetFilenameChars": false,
"TotalDownloadSpeedMbPerSecond": 0.5,
"StorageAccounts": [
{
"Name": "abc",
"SasConnectionString": "BlobEndpoint=https://abc.blob.core.windows.net/;QueueEndpoint=https://abc.queue.core.windows.net/;FileEndpoint=https://abc.file.core.windows.net/;TableEndpoint=https://abc.table.core.windows.net/;SharedAccessSignature=sv=2019-12-12&ss=bfqt&srt=sco&sp=rl&se=2020-12-20T04:37:08Z&st=2020-12-19T20:37:08Z&spr=https&sig=abce3e399jdkjs30fjsdlkD",
"FilenameContains": "",
"FilenameWithout": "",
"Containers": [
{
"Name": "test",
"FilenameContains": "",
"FilenameWithout": "",
"Blobs": [
{
"Filename": "2007 EasyRadiology.pdf",
"TargetFilename": "projects/radiology/Brochure3.pdf"
}
]
},
{
"Name": "test2"
}
]
},
{
"Name": "martintest3",
"SasConnectionString": "",
"Containers": []
}
]
}
}
Run the application with your job with:
blobtolocal job1.json
Without referring to 3rd party solutions, you can achieve that using built in features in Azure now using the below steps might help secure your blob.
Soft delete for Azure Storage Blobs
The better step is first to enable soft delete which is now in GA:
https://azure.microsoft.com/en-us/blog/soft-delete-for-azure-storage-blobs-ga
Read-access geo-redundant storage
The second approach is enable geo-replication for RA-RGA, so if the first data center is down you can always read from a secondary replica in another region, you can find more information in here:
https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy-grs
You can make a snapshot of a blog container and then download the snapshot for a point in time backup.
https://learn.microsoft.com/en-us/azure/storage/storage-blob-snapshots
A snapshot is a read-only version of a blob that's taken at a point in
time. Snapshots are useful for backing up blobs. After you create a
snapshot, you can read, copy, or delete it, but you cannot modify it.+
A snapshot of a blob is identical to its base blob, except that the
blob URI has a DateTime value appended to the blob URI to indicate the
time at which the snapshot was taken. For example, if a page blob URI
is http://storagesample.core.blob.windows.net/mydrives/myvhd, the
snapshot URI is similar to
http://storagesample.core.blob.windows.net/mydrives/myvhd?snapshot=2011-03-09T01:42:34.9360000Z.

Copying storage data from one Azure account to another

I would like to copy a very large storage container from one Azure storage account into another (which also happens to be in another subscription).
I would like an opinion on the following options:
Write a tool that would connect to both storage accounts and copy blobs one at a time using CloudBlob's DownloadToStream() and UploadFromStream(). This seems to be the worst option because it will incur costs when transferring the data and also be quite slow because data will have to come down to the machine running the tool and then get re-uploaded back to Azure.
Write a worker role to do the same - this should theoretically be faster and not incur any cost. However, this is more work.
Upload the tool to a running instance bypassing the worker role deployment and pray the tool finishes before the instance gets recycled/reset.
Use an existing tool - have not found anything interesting.
Any suggestions on the approach?
Update: I just found out that this functionality has finally been introduced (REST APIs only for now) for all storage accounts created on July 7th, 2012 or later:
http://msdn.microsoft.com/en-us/library/windowsazure/dd894037.aspx
You can also use AzCopy that is part of the Azure SDK.
Just click the download button for Windows Azure SDK and choose WindowsAzureStorageTools.msi from the list to download AzCopy.
After installing, you'll find AzCopy.exe here: %PROGRAMFILES(X86)%\Microsoft SDKs\Windows Azure\AzCopy
You can get more information on using AzCopy in this blog post: AzCopy – Using Cross Account Copy Blob
As well, you could remote desktop into an instance and use this utility for the transfer.
Update:
You can also copy blob data between storage accounts using Microsoft Azure Storage Explorer as well. Reference link
Since there's no direct way to migrate data from one storage account to another, you'd need to do something like what you were thinking. If this is within the same data center, option #2 is the best bet, and will be the fastest (especially if you use an XL instance, giving you more network bandwidth).
As far as complexity, it's no more difficult to create this code in a worker role than it would be with a local application. Just run this code from your worker role's Run() method.
To make things more robust, you could list the blobs in your containers, then place specific file-move request messages into an Azure queue (and optimize by putting more than one object name per message). Then use a worker role thread to read from the queue and process objects. Even if your role is recycled, at worst you'd reprocess one message. For performance increase, you could then scale to multiple worker role instances. Once the transfer is complete, you simply tear down the deployment.
UPDATE - On June 12, 2012, the Windows Azure Storage API was updated, and now allows cross-account blob copy. See this blog post for all the details.
here is some code that leverages the .NET SDK for Azure available at http://www.windowsazure.com/en-us/develop/net
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.WindowsAzure.StorageClient;
using System.IO;
using System.Net;
namespace benjguinAzureStorageTool
{
class Program
{
private static Context context = new Context();
static void Main(string[] args)
{
try
{
string usage = string.Format("Possible Usages:\n"
+ "benjguinAzureStorageTool CopyContainer account1SourceContainer account2SourceContainer account1Name account1Key account2Name account2Key\n"
);
if (args.Length < 1)
throw new ApplicationException(usage);
int p = 1;
switch (args[0])
{
case "CopyContainer":
if (args.Length != 7) throw new ApplicationException(usage);
context.Storage1Container = args[p++];
context.Storage2Container = args[p++];
context.Storage1Name = args[p++];
context.Storage1Key = args[p++];
context.Storage2Name = args[p++];
context.Storage2Key = args[p++];
CopyContainer();
break;
default:
throw new ApplicationException(usage);
}
Console.BackgroundColor = ConsoleColor.Black;
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("OK");
Console.ResetColor();
}
catch (Exception ex)
{
Console.WriteLine();
Console.BackgroundColor = ConsoleColor.Black;
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Exception: {0}", ex.Message);
Console.ResetColor();
Console.WriteLine("Details: {0}", ex);
}
}
private static void CopyContainer()
{
CloudBlobContainer container1Reference = context.CloudBlobClient1.GetContainerReference(context.Storage1Container);
CloudBlobContainer container2Reference = context.CloudBlobClient2.GetContainerReference(context.Storage2Container);
if (container2Reference.CreateIfNotExist())
{
Console.WriteLine("Created destination container {0}. Permissions will also be copied.", context.Storage2Container);
container2Reference.SetPermissions(container1Reference.GetPermissions());
}
else
{
Console.WriteLine("destination container {0} already exists. Permissions won't be changed.", context.Storage2Container);
}
foreach (var b in container1Reference.ListBlobs(
new BlobRequestOptions(context.DefaultBlobRequestOptions)
{ UseFlatBlobListing = true, BlobListingDetails = BlobListingDetails.All }))
{
var sourceBlobReference = context.CloudBlobClient1.GetBlobReference(b.Uri.AbsoluteUri);
var targetBlobReference = container2Reference.GetBlobReference(sourceBlobReference.Name);
Console.WriteLine("Copying {0}\n to\n{1}",
sourceBlobReference.Uri.AbsoluteUri,
targetBlobReference.Uri.AbsoluteUri);
using (Stream targetStream = targetBlobReference.OpenWrite(context.DefaultBlobRequestOptions))
{
sourceBlobReference.DownloadToStream(targetStream, context.DefaultBlobRequestOptions);
}
}
}
}
}
Its very simple with AzCopy. Download latest version from https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/
and in azcopy type:
Copy a blob within a storage account:
AzCopy /Source:https://myaccount.blob.core.windows.net/mycontainer1 /Dest:https://myaccount.blob.core.windows.net/mycontainer2 /SourceKey:key /DestKey:key /Pattern:abc.txt
Copy a blob across storage accounts:
AzCopy /Source:https://sourceaccount.blob.core.windows.net/mycontainer1 /Dest:https://destaccount.blob.core.windows.net/mycontainer2 /SourceKey:key1 /DestKey:key2 /Pattern:abc.txt
Copy a blob from the secondary region
If your storage account has read-access geo-redundant storage enabled, then you can copy data from the secondary region.
Copy a blob to the primary account from the secondary:
AzCopy /Source:https://myaccount1-secondary.blob.core.windows.net/mynewcontainer1 /Dest:https://myaccount2.blob.core.windows.net/mynewcontainer2 /SourceKey:key1 /DestKey:key2 /Pattern:abc.txt
I'm a Microsoft Technical Evangelist and I have developed a sample and free tool (no support/no guarantee) to help in these scenarios.
The binaries and source-code are available here: https://blobtransferutility.codeplex.com/
The Blob Transfer Utility is a GUI tool to upload and download thousands of small/large files to/from Windows Azure Blob Storage.
Features:
Create batches to upload/download
Set the Content-Type
Transfer files in parallel
Split large files in smaller parts that are transferred in parallel
The 1st and 3rd feature is the answer to your problem.
You can learn from the sample code how I did it, or you can simply run the tool and do what you need to do.
Write your tool as a simple .NET Command Line or Win Forms application.
Create and deploy a dummy we/worker role with RDP enabled
Login to the machine via RDP
Copy your tool over the RDP connection
Run the tool on the remote machine
Delete the deployed role.
Like you I am not aware of any of the off the shelf tools supporting a copy between function.
You may like to consider just installing Cloud Storage Studio into the role though and dumping to disk then re-uploading. http://cerebrata.com/Products/CloudStorageStudiov2/Details.aspx?t1=0&t2=7
Use could 'Azure Storage Explorer' (free) or some other such tool. These tools provide a way to download and upload content. You will need to manually create containers and tables - and of course this will incur a transfer cost - but if you are short on time and your contents are of reasonable size then this is a viable option.
I recommend use azcopy, you can copy the all the storage account, a container, a directory or a single blob. Here al example of cloning all the storage account:
azcopy copy 'https://{SOURCE_ACCOUNT}.blob.core.windows.net{SOURCE_SAS_TOKEN}' 'https://{DESTINATION_ACCOUNT}.blob.core.windows.net{DESTINATION_SAS_TOKEN}' --recursive
You can get SAS token from Azure Portal. Navigate to storage account overviews (source and destination), then in the sidenav click on "Shared access sigantura" and generate your own.
More examples here
I had to do somethign similar to move 600 GB of content from a local file system to Azure Storage. After a couple iterations of code I finally ended up with taking the 'Azure Storage Explorer' and extended it with ability to select folders instead of just files and then have it recursively drill into the multiple selected folders, loaded a list of Source / Destination copy item statements into an Azure Queue. Then in the upload section in 'Azure Storage Explorer', in the Queue section to pull from the queue and execute the copy operation.
Then I launched like 10 instances of the 'Azure Storage Explorer' tool and had each pulling from the queue and executing the copy operation. I was able to move the 600 GB of items in just over 2 days. Added in smarts to utilize the modified time stamps on files and have it skip over files that have already been both copied from the queue and not add to the queue if it is in sync. Now I can run "updates" or syncs within an hour or two across the whole library of content.
Try CloudBerry Explorer. It copies blob within and between subscriptions.
For copying between subscriptions, edit the storage account container's access from Private to Public Blob.
The copying process took few hours to complete. If you choose to reboot your machine, the process will continue. Check status by refreshing the target storage account container in Azure management UI by checking the timestamp, the value gets updated until the copy process completes.

Resources