Azure table storage with large entity sizes - azure

A couple of questions that I can't find any answers to. Hope someone can help:
I will be using entity sizes of almost 1MB. I can't find any information on read latency for these large entity sizes. Is there anyone out there that has any information on this.
Is there any way to determine how much space is used for a row in Azure table storage. Any API for this
Thanks

If most of our entities are large, then it somehow defeats the purpose of Table Storage in the first place. Indeed, you will only be able to retrieve update them 4 by 4 max, as entity transactions are limited to 4MB.
You can get many useful measurements on the AzureScope project concerning the Table Storage, but also the other storage services of Azure.
Then, if you want to accurately check the weight of your rows, just use Fiddler to intercept your web requests, and directly look at the XML being produced.

Related

CosmosDB: Efficiently migrate records from a large container

I created a container in CosmosDB that tracks metadata about each API call (timestamp, user id, method name, duration, etc.). The partition key is set to UserId and each id is a random Guid. This container also helps me enforce rate limiting for each user. So far so good. Now, I want to periodically clean up this container by moving records to an Azure Table (or something else) for long-term storage and generate reporting. Migrating records also helps me avoid the 20GB logical partition size limit.
However, I have concerns about whether cross-partition queries will bite me eventually. Say, I want to migrate all records that were created a week ago. Also, let's assume I have millions of active users. Thus, this container sees a lot of activity and I can't specify a partition key in my query. I'm reading that we should avoid cross-partition queries when RU/s and storage size are both big. See this. I have no idea how many physical partitions I'm going to end up dealing with in the future.
Is my design completely off? How can I efficiently migrate records? I'm hoping that the CosmosDB team can see this and help me find a solution to this problem.
The easier approach would be to use a time to live and just write events\data to both cosmos db and table storage at the same time, so that it stays in table storage forever, but is gone from Cosmos DB when TTL expires. You can specify TTL at document level, so if you need some documents to live longer - that can be done.
Another approach might be using the change feed.
Based on your updated comments:
You are writing a CosmosDb doc for each API request.
When an API call is made, you are querying CosmosDB for all API calls within a given time period with the partition being the userId. If the document count exceeds the threshold, return an error such as a HTTP 429.
You want to store API call information for longterm analysis.
If your API is getting a lot of use from a lot of users, using CosmosDB is going to be expensive to scale, both from a storage and a processing standpoint.
For rate limiting, consider this rate limiting pattern using Redis cache. The StackExchange.Redis package is mature, and has lots of guidance and code samples. It'll be a much lighter weight and scalable solution to your problem.
So for each API call, you would:
Read the Redis key for the user making the call. Check to see if it exceeds your threshold.
Increment the user's Redis key.
Write the API invocation into to Azure Table Storage, probably with the partition key being the userId, and the rowkey being whatever makes sense for you.

How efficient can Azure BLOB Table service can be?

How efficient azure blob tables can be?
Azure BLOB service has various components like Containers, Queues and Tables too. How efficient can tables be, what is their exact use case and why are they generally used with a supporting service like Azure CosmoDB.
Can anyone help me understand the concept and thought behind it?
Edit: The problem I am facing is that I have to log a processing batch of 700 000 data rows in C#, into BLOB Tables. How do I achieve this in the best practices?
This is a three in one question :-)
How efficient can tables be
Very efficient, if used properly. Every row in a table has a PartitionKey and Rowkey. When querying data it performs very well if you can reduce the set by using (parts of) the PartitionKey and RowKey. As soon as you start filtering on other columns performance can decrease very fast. See also the docs regarding this topic.
what is their exact use case
It is basically a key/value pair nosql solution. It can be used very efficient to store simple data in a fast and cheap manner. It is one of the cheapest options when it comes to data storage. Tables don't have a fixed schema (hence, nosql) and is used to store for example logs, configuration data and simple data structures.
and why are they generally used with a supporting service like Azure CosmosDB.
This is not the case. Azure Table Storage can be used on its own. CosmosDB has a Table API that lets you make uses of CosmosDB against code written for Azure Table Storage without code modifications. It allows for premium performance as not only the PartitionKey and Rowkey are indexed, but all the other columns as well. So as soon as you start filtering on other columns performance will still be very good. But it will costs you more in terms of money.
Data storage could be best done using batches as data is written per partition. See the answer of Ivan.
Some more material on when to use it:
https://markheath.net/post/azure-tables-what-are-they-good-for
https://blogs.msdn.microsoft.com/brunoterkaly/2013/01/13/knowing-when-to-choose-windows-azure-table-storage-or-windows-azure-sql-database/

Best practice - Storage options for external reference data that is queried in different ways

We have a cloud platform with various Health Care applications. Each application needs what we call reference data. Reference data is always external data coming from a provider on a daily or some regular schedule. An example of reference data is FDB MedKnowledge which includes a comprehensive compendium of consumer medication monographs, along with drug images and imprints.
Various applications will query the reference data to present it to their target customers (who can be physicians, nurses, technicians, procurement department etc...). A common global API will be developed to return the requested data.
Historical information is required ( for ex: FDB in 2017 had NDC1 which then got deleted from the FDB feed in 2019. So a physician who prescribed NDC1 should be able to query the information of that drug going through history).
Daily we receive the feed from the external provider and use it as input source to merge ( update, insert, delete) our reference data copy such that its live table reflects the latest external feed.
In Azure, we have the following storage options:
Blob storage
Cosmos Db
Azure sql database with system versioning
Azure Datawarehouse
Azure Data lake
What is the best practice to store external reference data? We are leaning toward azure sql database with system versioning. Have any of you worked with external reference data? If yes, what is your storage decision and has it worked well for you? I would like to hear your comments and opinions. Thank you!
You need to base your choice on the type of data you are trying to store, and how you need to reference it. It sounds like you might actually need a few different technologies here.
For example, Azure SQL is great for storing relational data. So if your data is tabular in form and needs to have relationships between it, then this is a good choice. However, if you're going to be storing millions and millions of rows then performance might suffer in a relational database. In that sort of scenario, or one where you have lots of transactional data you might want to look at Cosmos DB.
You mentioned images at one point, putting these in a database is not a good idea, in this sort of scenario you are going to want to look at using blob storage.
"Reference Data" really doesn't mean anything, look at the individual types of data you need to store, and how this data is used, and make decisions based on this. For lots of different types of data, there is unlikely to be a one size fits all solution.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

ColumnStore index benefits on Azure?

We are currently running on Azure and we have a table with hundreds of millions of rows. This table is static and will be refreshed weekly. We've looked at ColumnStore index but unfortunately it is not Azure yet so below are my questions,
Will ColumnStore index be available in Azure?
if not what other technology can we use to get the same performance
benefits as the ColumnStore index would provide?
Can we get the same query performance by using Azure Table Storage?
I'm a newbie to both Azure and Columnar databases so please bear me with me if I ask these questions.. :)
About ColumnStore, if you have bought the license, you can check with development team or ask on blogs such as ScottGu's Blog. From there only you will come to know about any feature release.
Azure Database is designed for scalability. You will need to use the Partition Key very wisely. Partition Key is like index of book, so if you want to search something in book, you can quickly refer to the index and reach the page quickly. In other words, you can group data depending upon certain criteria and store it in a single partition. So where ever you have the same criteria, your query will hit only one partition. The thing with partitions is, for a table you can any number of partition, but it is not necessary that all the partition will reside on same machine or even same farm. So when you fire a query on badly designed Azure Table, it can hit more than one server, and thus bad performance. Read about Real World: Designing a Scalable Partitioning Strategy for Windows Azure Table Storage
Hope you get what you are looking for.
As Amar pointed out, keep an eye on the team blogs for the latest in new feature announcements. The goal for SQL Azure is for it to eventually be where new features are found first. However, it will still take awhile for things to get there.
As for your peformance question, there's no simple answer for this. Windows Azure resources are designed for scale, not necessarially high performance. So its to take your scale/capacity targets into account when designing solutions. For your situation, I would encourage you to conside table storage, but this will depend on frequency access and the types of queries you need to make on the data. Just do not be surprised if you have to mave redundant copies of your data that are modelled differently, or possibly even running parrallel queries and aggregating results. This is the way table storage was designed to be used. Its cheaper then SQL Azure and its this price difference that makes redundant specialized data models possible.
This approach also has to be weighed against the cost of retraining your developers to stop thinking in RDBMS terms. :)

Resources