Here is the problem. I have the devices pushing telemetry messages to Azure IoT hub and currently, I save all messages to the Table Storage with partition key device Id and row key telemetry kind. What I want to do is restrict the size of stored data. For instance, the table should keep only up to 50 MB and the should be cleared. What kind of storage should I use for such use case and what are the benefits? Any suggestions are highly appreciated.
Neither Azure Tables nor Azure Blobs have the feature where the content automatically gets deleted after a certain size is reached. In fact, I don't think I have come across any cloud storage solution that offers it (I've seen the data gets automatically deleted based on age).
Thus if you want to delete the data once it reaches a certain size, you will have to write some code and schedule it (using either Functions or WebJobs). That code will find the size occupied and delete the data going over the limit.
Between Blobs and Tables, I am somewhat conflicted. With Blobs, it is much easier to get the storage consumed - You just list the blobs in a container and sum up the size of the blobs. With tables, you will need to keep on fetching entities (i.e. download the data) and calculate the size of that data. But then deleting data from tables is easier as you will be deleting rows (unless you store each record in a separate blob).
If it were not on the data size and rather based on the data age, I would have recommended Cosmos DB. Though more expensive than Azure Storage, but you could define TTL at the collection level and based on that policy, the documents will be automatically deleted.
Related
There is a requirement to store data (xml data) in some storage. As it is mentioned it stores large XML data, each record (row) size nearly 1MB. The doubt is which storage we are going to use to store the data means Azure Table storage (Storage Account) or Sql Azure.
So which storage will help data store and retrieval faster?
When looking at sheer volume, Table Storage is today far more scalable than SQL Azure. Given a storage account (storage accounts hold blobs, queues and tables) is allowed to be 100TB in size, in theory your table could consume all 100TB. At first glance, a 100TB chunk of data may seem overwhelming. However, Table Storage can be partitioned. Each partition of Table Storage can be moved to a separate server by the Azure controller thereby reducing the load on any single server. As demand lessens, the partitions can be reconsolidated. Reads of Azure Table Storage are load balanced across three replicas to help performance.
Entities in Table Storage are limited to 1MB each with no more than 255 properties (3 of which are required partition key, row key, and timestamp).
Today, SQL Azure databases are limited to 1GB or 10GB. However, sometime this month (June 2010), a 50GB limit is supposed to be available. What happens if your database is larger than 10GB today (or 50GB tomorrow)? Options include repartitioning your database into multiple smaller databases or sharding (Microsoft’s generally recommended approach). Without getting into the database details of both of these database design patterns, both of these approaches are not without issue and complexity, some of which must be resolved at the application level.
It's hard to say Azure table storage data retrieval must be faster than Sql Azure. It depend on your data structure, size.
As you said, each record (row) size of your XML data nearly 1MB, if not exceed the limit 1MB, you can choose the Table Storage first.
You can reference this document to know more comparisons about Azure Table Storage and SQL Azure: Azure Table Storage vs. Windows SQL Azure
Hope this helps.
I want to store and retrieve some JSON data from a file of size upto 1MB. Should I use Azure table storage or blob storage?
An entity in Table Storage (equivalent to a row in a table in RDBMS) can be up to 1MB, however individual attributes in an entity (equivalent to columns) can only be 64KB. You can spread your JSON over multiple attributes, however this would only work if you can guarantee that every file ever is guaranteed to be well below 1MB. (You will need some room for your system attributes like PartitionKey, RowKey, etc).
I would suggest looking into another store: DocumentDB, MongoDB or perhaps even a Redis cache that you back with another non-volatile storage. Maybe a Azure Sql DB will suffice, now that it has support for retrieving JSON values.
Another solution would be saving the files in BLOB storage and referencing them from the table storage. If you would need to look up multiple files at once, this might be slower though.
+1 for the solution to store the data in blob storage and referencing the blob uri in the table. You can also do is update the blob metadata properties with the unique identifiers of the table so that even if you just retrieve the blobs you can get what entity it belongs to.
In our service, we are using SQL Azure as the main storage, and Azure table for the backup storage. Everyday about 30GB data is collected and stored to SQL Azure. Since the data is no longer valid from the next day, we want to migrate the data from SQL Azure to Azure table every night.
The question is.. what would be the most efficient way to migrate data from Azure to Azure table?
The naive idea i came up with is to leverage the producer/consumer concept by using IDataReader. That is, first get a data reader by executing "select * from TABLE" and put data into a queue. At the same time, a set of threads are working to grab data from the queue, and insert them into Azure Table.
Of course, the main disadvantage of this approach (i think) is that we need to maintain the opened connection for a long time (might be several hours).
Another approach is to first copy data from SQL Azure table to local storage on Windows Azure, and use the same producer/consumer concept. In this approach we can disconnect the connection as soon as the copy is done.
At this point, i'm not sure which one is better, or even either of them is a good design to implement. Could you suggest any good design solution for this problem?
Thanks!
I would not recommend using local storage primarily because
It is transient storage.
You're limited by the size of local storage (which in turn depends on the size of the VM).
Local storage is local only i.e. it is accessible only to the VM in which it is created thus preventing you from scaling out your solution.
I like the idea of using queues, however I see some issues there as well:
Assuming you're planning on storing each row in a queue as a message, you would be performing a lot of storage transactions. If we assume that your row size is 64KB, to store 30 GB of data you would be doing about 500000 write transactions (and similarly 500000 read transactions) - I hope I got my math right :). Even though the storage transactions are cheap, I still think you'll be doing a lot of transactions which would slow down the entire process.
Since you're doing so many transactions, you may get hit by storage thresholds. You may want to check into that.
Yet another limitation is the maximum size of a message. Currently a maximum of 64KB of data can be stored in a single message. What would happen if your row size is more than that?
I would actually recommend throwing blob storage in the mix. What you could do is read a chunk of data from SQL table (say 10000 or 100000 records) and save that data in blob storage as a file. Depending on how you want to put the data in table storage, you could store the data in CSV, JSON or XML format (XML format for preserving data types if it is needed). Once the file is written in blob storage, you could write a message in the queue. The message will contain the URI of the blob you've just written. Your worker role (processor) will continuously poll this queue, get one message, fetch the file from blob storage and process that file. Once the worker role has processed the file, you could simply delete that file and the message.
I need to do an automatic periodic backup of an Azure blob storage to another Azure blob storage.
This is in order to guard against any kind of malfunction in the software.
Are there any services which do that? Azure doesn't seem to have this
As #Brent mentioned in the comments to Roberto's answer, the replicas are for HA; if you deleted a blob, that delete is replicated instantly.
For blobs, you can very easily create asynchronous copies to a separate blob (even in a separate storage account). You can also make snapshots which capture a blob at a current moment in time. At first, snapshots don't cost anything, but if you start modifying the blocks/pages referred to by the snapshot, then new blocks/pages are allocated. Over time, you'll want to start purging your snapshots. This is a great way to keep data "as-is" over time and revert back to a snapshot if there's a malfunction in your software.
With queues, the malfunction story isn't quite the same, as typically you'd only have a small number of queue items present (at least that's the hope; if you have thousands of queue messages, this is typically a sign that your software is falling behind). In any event: You could, when writing queue messages, write your queue messages to blob storage, for archive purposes, in case there's a malfunction. I wouldn't recommend using blob- based messaging for scaling/parallel processing, since they don't have the mechanisms in place that queues do, but you could use them manually in case of malfunction.
There's no copy function for tables. You'd need to write to two tables during your write.
Azure keeps 3 redundant copies of your data in different locations in the same data centre where your data is hosted (to guard against hardware failure).
This applies to blob, table and queue storage.
Additionally, You can enable geo-replication on all of your storage. Azure will automatically keep redundant copies of your data in separate data centres. This guards against anything happening to the data centre itself.
See Here
Say I need to retrieve 20 thumbnail images from Azure BLOB after a button click. I've read that blobs are accessed like so http://<storage account>.blob.core.windows.net/<container>/<blob>
A single GetBlob() request is charged at 1 transaction. Is this to say getting 20 images will cost, at a minimum, 20 transactions?
Is there a way to send a batch request such that it retrieves those images and is billed at 1 transaction?
I've read about Entity Group Transactions, but it sounded to me they are for Azure Table only.
There's nothing akin to Entity Group Transactions with blobs. Each is accessed individually, burning at least one transaction (depending on blob size).
At a penny per 10,000100,000 transactions, this will likely not be a major cost factor unless you're constantly downloading blobs. In that case, it might be worth considering some type of cache, to prevent excessive activity against Blob Storage.
One other workaround (hack?): If you're always grabbing a set of related blobs, you could store that related collection in a zip file, in a single blob. Not saying I'm in favor of this, but if you need to save transactions, at least it's an option (aside from cache).
Take a look at this MSDN article, which describes storage and how partitions related to blobs and tables (scroll down to the Partitions section). The pertinent info for you: Each blob is in its own partition. With table storage, you're able to perform atomic actions on entities within a single partition (there are no atomic actions across multiple partitions). This is why you don't see atomic operations across multiple blobs.