How to optimize transactions costs from testing existence of keys? - azure

I'm designing an application using Azure Storage Blobs/Table/Queue, handling massive amount of data.
One important aspect of the application, is that work will be done if a given key don't exist, and determining the existence of a key is a frequent and intensive task.
I need to optimize as much possible billable transactions from existence checks of keys.
It could be either against blobs or tables.
I looked at this document Understanding Windows Azure Storage Billing – Bandwidth, Transactions, and Capacity It seems that 404 errors are not counted only from anonymous requests.
I was also thinking of using a BatchTableOperation to check 100 keys at once, maybe using a Replace or Merge, and determine in the results if the key indeed existed (haven't tried, actualy I got the idea while writing)
Any good hack are welcomed.

You should use Windows Azure Caching:
Load all existing keys in the cache
Each time you add a record to Table Storage, also add it to cache
Once you've done that, have your application check cache first. If the item is not present there, check Table Storage just to be sure (to cover edge cases). But 99% of the time, if the item has already been processed the key will be available in the cache and you won't need to query Table Storage (this will drastically reduce transactions to Table Storage).
If using Windows Azure Caching is not an option there are alternatives, like using MemoryCache, save all keys in a file, ...

Related

CosmosDB: Efficiently migrate records from a large container

I created a container in CosmosDB that tracks metadata about each API call (timestamp, user id, method name, duration, etc.). The partition key is set to UserId and each id is a random Guid. This container also helps me enforce rate limiting for each user. So far so good. Now, I want to periodically clean up this container by moving records to an Azure Table (or something else) for long-term storage and generate reporting. Migrating records also helps me avoid the 20GB logical partition size limit.
However, I have concerns about whether cross-partition queries will bite me eventually. Say, I want to migrate all records that were created a week ago. Also, let's assume I have millions of active users. Thus, this container sees a lot of activity and I can't specify a partition key in my query. I'm reading that we should avoid cross-partition queries when RU/s and storage size are both big. See this. I have no idea how many physical partitions I'm going to end up dealing with in the future.
Is my design completely off? How can I efficiently migrate records? I'm hoping that the CosmosDB team can see this and help me find a solution to this problem.
The easier approach would be to use a time to live and just write events\data to both cosmos db and table storage at the same time, so that it stays in table storage forever, but is gone from Cosmos DB when TTL expires. You can specify TTL at document level, so if you need some documents to live longer - that can be done.
Another approach might be using the change feed.
Based on your updated comments:
You are writing a CosmosDb doc for each API request.
When an API call is made, you are querying CosmosDB for all API calls within a given time period with the partition being the userId. If the document count exceeds the threshold, return an error such as a HTTP 429.
You want to store API call information for longterm analysis.
If your API is getting a lot of use from a lot of users, using CosmosDB is going to be expensive to scale, both from a storage and a processing standpoint.
For rate limiting, consider this rate limiting pattern using Redis cache. The StackExchange.Redis package is mature, and has lots of guidance and code samples. It'll be a much lighter weight and scalable solution to your problem.
So for each API call, you would:
Read the Redis key for the user making the call. Check to see if it exceeds your threshold.
Increment the user's Redis key.
Write the API invocation into to Azure Table Storage, probably with the partition key being the userId, and the rowkey being whatever makes sense for you.

Azure: Redis vs Table Storage for web cache

We currently use Redis as our persistent cache for our web application but with it's limited memory and cost I'm starting to consider whether Table storage is a viable option.
The data we store is fairly basic json data with a clear 2 part key which we'd use for the partition and row key in table storage so I'm hoping that would mean fast querying.
I appreciate one is in memory and one is out so table storage will be a bit slower but as we scale I believe there is only one CPU serving data from a Redis cache whereas with Table storage we wouldn't have that issue as it would be down to the number of web servers we have running.
Does anyone have any experience of using Table storage in this way or comparisons between the 2.
I should add we use Redis in a very minimalist way get/set and nothing more, we evict our own data and failing that leave the eviction to Redis when it runs out of space.
This is a fairly broad/opinion-soliciting question. But from an objective perspective, these are the attributes you'll want to consider when deciding which to use:
Table Storage is a durable, key/value store. As such, content doesn't expire. You'll be responsible for clearing out data.
Table Storage scales to 500TB.
Redis is scalable horizontally across multiple nodes (or, scalable via Redis Service). In contrast, Table Storage will provide up to 2,000 transactions / sec on a partition, 20,000 transactions / sec across the storage account, and to scale beyond, you'd need to utilize multiple storage accounts.
Table Storage will have a significantly lower cost footprint than a VM or Redis service.
Redis provides features beyond Azure Storage tables (such as pub/sub, content eviction, etc).
Both Table Storage and Redis Cache are accessible via an endpoint, with many language-specific SDK wrappers around the API's.
I find some metrials about the azure redis and table, hope that it can help you.There is a video about Azure Redis that also including a demo to compare between table storage and redis about from 50th minute in the videos.
Perhaps it can be as reference. But detail performance it depends on your application, data records and so on.
The pricing of the table storage depends on the capacity of table storage, please refer to details. It is much cheaper than redis.
There are many differences you might care about, including price, performance, and feature set. And, persistence of data, and data consistency.
Because redis is an in-memory data store it is pretty expensive. This is so that you may get low latency. Check out Azure's planning FAQ here for a general understanding of redis performance in a throughput sense.
Azure Redis planning FAQ
Redis does have an optional persistence feature, that you can turn on, if you want your data persisted and restored when the servers have rare downtime. But it doesn't have a strong consistency guarantee.
Azure Table Storage is not a caching solution. It’s a persistent storage solution, and saves the data permanently on some kind of disk. Historically (disclaimer I have not look for the latest and greatest performance numbers) it has much higher read and write latency. It is also strictly a key-value store model (with two-part keys). Values can have properties but with many strict limitations, around size of objects you can store, length of properties, and so on. These limitations are inflexible and painful if your app runs up against them.
Redis has a larger feature set. It can do key-value but also has a bunch of other data structures like sets and lists, and many apps can find ways to benefit from that added flexibility.
See 'Introduction to Redis' (redis docs) .
CosmosDB could be yet another alternative to consider if you're leaning primarily towards Azure technologies. It is pretty expensive, but quite fast and feature-rich. While also being primarily intended to be a persistent store.

Possible to make partition key added on server side Azure Table Storage

I noticed that Windows Azure Diagnostics uses a UTC ticks primary key as a method of making it easy to access entries by time ranges. I would like to implement a similar system for my table.
However a major issue is that the systems that will be doing the uploading will not necessarily have their time synced to the millisecond (not to mention ping time differences) so setting the Partition key locally and then uploading doesn't work well (I am having all kinds of race condition issues). Ideally I would like to guarantee that any time a table entry is made its partition key is certain to be greater than or equal to any partition key already in the table (since that's how time works).
The only way that I can think to ensure this guarantee is by having the "timestamp" partition key set server side. Is there some way to have this happen, such as via a server side script?
Note: I realize that a timestamp is added already for when an entry is made, but tables are not indexed by this timestamp.
I would recommend using twitter / snowflake solution for that purpose. I had very similar requirement and that approach worked perfectly for me.
I used Flake ID Generator. It is .net implementation based on twitter / snowflake.
The generator can be independently deployed to different Azure instances (or work as a independent service) - I was generating ids independently on each Azure service instance. Generated 64-bit ids are directly sortable and always unique (even if come from the different instances at the same time). You also have access to its source code so you can also add customizations if needed.
I hope that will help.

Table Storage Service (Azure's implementation of nosql) vs Windows Azure Caching (unstructured in memory cached)

We want to implement caching in Azure for two main reasons:
Speed up repetive data access
Reduce stress on the database
Here are the characteristics of the data we are planning to cache:
Relatively small (1 - 100 kb)
Specific to each customer
Not private, but we don't really want random people navigating through our entire cache
XML or JSON
Consumed by C# (i.e. not linked to directly in the html)
Most weeks the data will not change, although some days the data could change several times
For this specific purpose Table storage appears better than Blob storage (we did just implement Blob storage for images, CSS, and JavaScript) and Windows Azure Caching appears better than Windows Azure Shared Cache (perhaps almost always better and the shared caching is mostly a legacy feature at this point).
The programming API of both appears straight forward. Compared to what we pay for cloud sites the cost of each seems to be negligible.
So far we are leaning toward Table Storage due to what we perceive to be the pros and cons of Azure Caching. As old .Net guys we are much more familiar with In-Memory Cache than NoSql style solutions:
Problems with Windows Azure Caching:
If the VM is moved to a different server (by Microsoft for load balancing or whatever reasons) is the in-memory cache moved intact?
We are guessing that whenever we publish changes to the cloud it wipes out the existing in-memory cache
While the users rarely make changes to the cached data when they do make changes it is likely that they may make multiple updates within seconds and we are not sure how this is going to work with cache located across multiple nodes running web roles especially with increased traffic. (this is probably a concern with table storage as well!)
Table storage appears like it will be easier to debug
Advantages of Windows Azure Caching
somewhat faster
Your familiarity with in-memory caching is the model that you need to understand to implement caching on Windows Azure. The 'NoSql style' is not caching, but storage. So table storage rather replaces SQL than it replaces caching. Table storage is for persistent, reliable storage β€” with all of the latency and other disadvantages of persistence that do not exist with in-memory cache.
Writing to cache is always secondary. When your users 'make changes to the cached data' you will always be writing out the data to disk (e.g. SQL), and then writing out the same data to the cache because you might as well, since you have the data on-hand (although secondary effects on written data may mean that you should invalidate or re-read the cached item).
The wiping out of data when a machine recycles should not be much of a concern, as the data is stored elsewhere. Every read from the cache should be followed by an 'if not found then read from database' kind of statement. You can warm-up the cache when a role starts to pre-populate items that you know that you are going to need.
Caching on Azure is distributed across the nodes and updating an existing item will always update on the node that it resides. Quick updates may be less of a problem than you think.
For in-memory caching use Windows Azure caching (you are right about shared caching being legacy) and, depending on your needs, look at other caching technologies like memcached. Caching and table storage are not comparable. Table storage is for long-term persistence. Don't unnecessarily hack table storage to do caching β€” making table storage temporary creates a whole bunch of things that you need to worry about yourself, like expiry and invalidation.

Creating incremental reports using Azure Tables

I need to create incremental reports in the table storage. I need to be able to update the same records from several different worker role instances (different roles with several instances each).
My reports consist mainly of values that I need to increment after I parse the raw data I initially stored.
The optimistic solution I found is to use a retry mechanism: Try to update the record. If you get a 412 result code (you don't have the latest ETAG value), retry. This solution becomes less efficient and more costly the more users you have and the more data you need to update simultaneously (my case exactly).
Another solution that comes to mind is to have only one instance of one worker role that can possibly update any given record. This is very problematic because this means that I will by-design create bottlenecks in my architecture, which is the opposite of the scale I want to reach with Azure.
If anyone here has some best practices in mind for such a use case, I would love to hear it.
Most cloud storages (Table Storage is one of those) do not offer scalable writes on a single entity/blob/whatever. There is no quick-fix for this limitation, as this limitation comes from the core tradeoff that have being made to create cloud storage in the first place.
Basically, a storage unit (entity/blob/whatever) can be updated about once every 20ms, and that's about it. Having a dedicated worker or not will not change anything to this aspect.
Instead, you need to address your task from from a different angle. For counters, the most usual approach is the use of sharded counters (link is for GAE, but you can implement an equivalent behavior on Azure).
Also, another way to ease the pain to go for an asynchronous architecture ala CQRS where the performance constraints you put on the update latency of entities is significantly relaxed.
I believe the approach needs re-architecture. In order to ensure scalability and limit amount of contention, you want to make sure that every write can work optimistically by providing unique Table/PartitionKey/RowKey
If you need those values for reports to be merged together, have a separate process/worker that will post-aggregated/merge the records for reporting purposes. You can use a queue or a timing mechanism to start aggregation/merging

Resources