CosmosDB: Efficiently migrate records from a large container - azure

I created a container in CosmosDB that tracks metadata about each API call (timestamp, user id, method name, duration, etc.). The partition key is set to UserId and each id is a random Guid. This container also helps me enforce rate limiting for each user. So far so good. Now, I want to periodically clean up this container by moving records to an Azure Table (or something else) for long-term storage and generate reporting. Migrating records also helps me avoid the 20GB logical partition size limit.
However, I have concerns about whether cross-partition queries will bite me eventually. Say, I want to migrate all records that were created a week ago. Also, let's assume I have millions of active users. Thus, this container sees a lot of activity and I can't specify a partition key in my query. I'm reading that we should avoid cross-partition queries when RU/s and storage size are both big. See this. I have no idea how many physical partitions I'm going to end up dealing with in the future.
Is my design completely off? How can I efficiently migrate records? I'm hoping that the CosmosDB team can see this and help me find a solution to this problem.

The easier approach would be to use a time to live and just write events\data to both cosmos db and table storage at the same time, so that it stays in table storage forever, but is gone from Cosmos DB when TTL expires. You can specify TTL at document level, so if you need some documents to live longer - that can be done.
Another approach might be using the change feed.

Based on your updated comments:
You are writing a CosmosDb doc for each API request.
When an API call is made, you are querying CosmosDB for all API calls within a given time period with the partition being the userId. If the document count exceeds the threshold, return an error such as a HTTP 429.
You want to store API call information for longterm analysis.
If your API is getting a lot of use from a lot of users, using CosmosDB is going to be expensive to scale, both from a storage and a processing standpoint.
For rate limiting, consider this rate limiting pattern using Redis cache. The StackExchange.Redis package is mature, and has lots of guidance and code samples. It'll be a much lighter weight and scalable solution to your problem.
So for each API call, you would:
Read the Redis key for the user making the call. Check to see if it exceeds your threshold.
Increment the user's Redis key.
Write the API invocation into to Azure Table Storage, probably with the partition key being the userId, and the rowkey being whatever makes sense for you.

Related

How efficient can Azure BLOB Table service can be?

How efficient azure blob tables can be?
Azure BLOB service has various components like Containers, Queues and Tables too. How efficient can tables be, what is their exact use case and why are they generally used with a supporting service like Azure CosmoDB.
Can anyone help me understand the concept and thought behind it?
Edit: The problem I am facing is that I have to log a processing batch of 700 000 data rows in C#, into BLOB Tables. How do I achieve this in the best practices?
This is a three in one question :-)
How efficient can tables be
Very efficient, if used properly. Every row in a table has a PartitionKey and Rowkey. When querying data it performs very well if you can reduce the set by using (parts of) the PartitionKey and RowKey. As soon as you start filtering on other columns performance can decrease very fast. See also the docs regarding this topic.
what is their exact use case
It is basically a key/value pair nosql solution. It can be used very efficient to store simple data in a fast and cheap manner. It is one of the cheapest options when it comes to data storage. Tables don't have a fixed schema (hence, nosql) and is used to store for example logs, configuration data and simple data structures.
and why are they generally used with a supporting service like Azure CosmosDB.
This is not the case. Azure Table Storage can be used on its own. CosmosDB has a Table API that lets you make uses of CosmosDB against code written for Azure Table Storage without code modifications. It allows for premium performance as not only the PartitionKey and Rowkey are indexed, but all the other columns as well. So as soon as you start filtering on other columns performance will still be very good. But it will costs you more in terms of money.
Data storage could be best done using batches as data is written per partition. See the answer of Ivan.
Some more material on when to use it:
https://markheath.net/post/azure-tables-what-are-they-good-for
https://blogs.msdn.microsoft.com/brunoterkaly/2013/01/13/knowing-when-to-choose-windows-azure-table-storage-or-windows-azure-sql-database/

Azure: Redis vs Table Storage for web cache

We currently use Redis as our persistent cache for our web application but with it's limited memory and cost I'm starting to consider whether Table storage is a viable option.
The data we store is fairly basic json data with a clear 2 part key which we'd use for the partition and row key in table storage so I'm hoping that would mean fast querying.
I appreciate one is in memory and one is out so table storage will be a bit slower but as we scale I believe there is only one CPU serving data from a Redis cache whereas with Table storage we wouldn't have that issue as it would be down to the number of web servers we have running.
Does anyone have any experience of using Table storage in this way or comparisons between the 2.
I should add we use Redis in a very minimalist way get/set and nothing more, we evict our own data and failing that leave the eviction to Redis when it runs out of space.
This is a fairly broad/opinion-soliciting question. But from an objective perspective, these are the attributes you'll want to consider when deciding which to use:
Table Storage is a durable, key/value store. As such, content doesn't expire. You'll be responsible for clearing out data.
Table Storage scales to 500TB.
Redis is scalable horizontally across multiple nodes (or, scalable via Redis Service). In contrast, Table Storage will provide up to 2,000 transactions / sec on a partition, 20,000 transactions / sec across the storage account, and to scale beyond, you'd need to utilize multiple storage accounts.
Table Storage will have a significantly lower cost footprint than a VM or Redis service.
Redis provides features beyond Azure Storage tables (such as pub/sub, content eviction, etc).
Both Table Storage and Redis Cache are accessible via an endpoint, with many language-specific SDK wrappers around the API's.
I find some metrials about the azure redis and table, hope that it can help you.There is a video about Azure Redis that also including a demo to compare between table storage and redis about from 50th minute in the videos.
Perhaps it can be as reference. But detail performance it depends on your application, data records and so on.
The pricing of the table storage depends on the capacity of table storage, please refer to details. It is much cheaper than redis.
There are many differences you might care about, including price, performance, and feature set. And, persistence of data, and data consistency.
Because redis is an in-memory data store it is pretty expensive. This is so that you may get low latency. Check out Azure's planning FAQ here for a general understanding of redis performance in a throughput sense.
Azure Redis planning FAQ
Redis does have an optional persistence feature, that you can turn on, if you want your data persisted and restored when the servers have rare downtime. But it doesn't have a strong consistency guarantee.
Azure Table Storage is not a caching solution. It’s a persistent storage solution, and saves the data permanently on some kind of disk. Historically (disclaimer I have not look for the latest and greatest performance numbers) it has much higher read and write latency. It is also strictly a key-value store model (with two-part keys). Values can have properties but with many strict limitations, around size of objects you can store, length of properties, and so on. These limitations are inflexible and painful if your app runs up against them.
Redis has a larger feature set. It can do key-value but also has a bunch of other data structures like sets and lists, and many apps can find ways to benefit from that added flexibility.
See 'Introduction to Redis' (redis docs) .
CosmosDB could be yet another alternative to consider if you're leaning primarily towards Azure technologies. It is pretty expensive, but quite fast and feature-rich. While also being primarily intended to be a persistent store.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

How to optimize transactions costs from testing existence of keys?

I'm designing an application using Azure Storage Blobs/Table/Queue, handling massive amount of data.
One important aspect of the application, is that work will be done if a given key don't exist, and determining the existence of a key is a frequent and intensive task.
I need to optimize as much possible billable transactions from existence checks of keys.
It could be either against blobs or tables.
I looked at this document Understanding Windows Azure Storage Billing – Bandwidth, Transactions, and Capacity It seems that 404 errors are not counted only from anonymous requests.
I was also thinking of using a BatchTableOperation to check 100 keys at once, maybe using a Replace or Merge, and determine in the results if the key indeed existed (haven't tried, actualy I got the idea while writing)
Any good hack are welcomed.
You should use Windows Azure Caching:
Load all existing keys in the cache
Each time you add a record to Table Storage, also add it to cache
Once you've done that, have your application check cache first. If the item is not present there, check Table Storage just to be sure (to cover edge cases). But 99% of the time, if the item has already been processed the key will be available in the cache and you won't need to query Table Storage (this will drastically reduce transactions to Table Storage).
If using Windows Azure Caching is not an option there are alternatives, like using MemoryCache, save all keys in a file, ...

Azure table storage with large entity sizes

A couple of questions that I can't find any answers to. Hope someone can help:
I will be using entity sizes of almost 1MB. I can't find any information on read latency for these large entity sizes. Is there anyone out there that has any information on this.
Is there any way to determine how much space is used for a row in Azure table storage. Any API for this
Thanks
If most of our entities are large, then it somehow defeats the purpose of Table Storage in the first place. Indeed, you will only be able to retrieve update them 4 by 4 max, as entity transactions are limited to 4MB.
You can get many useful measurements on the AzureScope project concerning the Table Storage, but also the other storage services of Azure.
Then, if you want to accurately check the weight of your rows, just use Fiddler to intercept your web requests, and directly look at the XML being produced.

Resources