I've setup an Azure batch process to read multiple csv files at the same time and write to Azure DocumentDb. I need a suggestion on the consistency level that fits the best for me.
I read through the consistency levels document(http://azure.microsoft.com/en-us/documentation/articles/documentdb-consistency-levels/) but am unable to relate my business case to the options provided in there.
My process
Get Document by Id
-If found then will pull a copy of the document, update changes and replace it.
-If not found, create a new entry.
if your writes and reads are from the same process (or you can share an instance of the documentclient) then session consistency will give you the best performance while ensuring you get consistent reads. This is because each SDK manages the session tokens ensuring that the read goes to a replica that has seen the write. Even if you don't do this, in your case the write will fail if you use the same document id. Within a collection, document ids are guaranteed to be unique.
Short version - session consistency (the default) is probably a good choice.
Related
I created a container in CosmosDB that tracks metadata about each API call (timestamp, user id, method name, duration, etc.). The partition key is set to UserId and each id is a random Guid. This container also helps me enforce rate limiting for each user. So far so good. Now, I want to periodically clean up this container by moving records to an Azure Table (or something else) for long-term storage and generate reporting. Migrating records also helps me avoid the 20GB logical partition size limit.
However, I have concerns about whether cross-partition queries will bite me eventually. Say, I want to migrate all records that were created a week ago. Also, let's assume I have millions of active users. Thus, this container sees a lot of activity and I can't specify a partition key in my query. I'm reading that we should avoid cross-partition queries when RU/s and storage size are both big. See this. I have no idea how many physical partitions I'm going to end up dealing with in the future.
Is my design completely off? How can I efficiently migrate records? I'm hoping that the CosmosDB team can see this and help me find a solution to this problem.
The easier approach would be to use a time to live and just write events\data to both cosmos db and table storage at the same time, so that it stays in table storage forever, but is gone from Cosmos DB when TTL expires. You can specify TTL at document level, so if you need some documents to live longer - that can be done.
Another approach might be using the change feed.
Based on your updated comments:
You are writing a CosmosDb doc for each API request.
When an API call is made, you are querying CosmosDB for all API calls within a given time period with the partition being the userId. If the document count exceeds the threshold, return an error such as a HTTP 429.
You want to store API call information for longterm analysis.
If your API is getting a lot of use from a lot of users, using CosmosDB is going to be expensive to scale, both from a storage and a processing standpoint.
For rate limiting, consider this rate limiting pattern using Redis cache. The StackExchange.Redis package is mature, and has lots of guidance and code samples. It'll be a much lighter weight and scalable solution to your problem.
So for each API call, you would:
Read the Redis key for the user making the call. Check to see if it exceeds your threshold.
Increment the user's Redis key.
Write the API invocation into to Azure Table Storage, probably with the partition key being the userId, and the rowkey being whatever makes sense for you.
I am trying a use case with cosmoseDB where we want to maintain one CosmoseDB but split the data into US region and Europe region with some partition key?
And for inserting/updating documents, application know which region(US/Europe) the documents go so is it possible to point to the right region while inserting/updating the document?
As I know , Cosmos DB global distribution mechanism guarantees consistency of all replica sets.
When you create the distribute cosmos db account, you enable the geo-redundancy.
You will see the regions of read and write separation.
Write operations are completed in write region and replicated to other read regions to ensure consistency. On the client side, there is no need to point to specific region to write data. From perspective of consistency , all region data supposed to be same.
More details, you could refer to this document.
Hope it helps you.
Can you have multiple write regions?
DocumentDB has good build-in features to bring read operations closer to consumers by adding read-regions to your documentDB account. You can read about it in documentation: "How to setup Azure Cosmos DB global distribution using the SQL API".
My understanding based on this is that there is always only 1 write region at any given time. I would not bet my thumbs on it, but it's hinted at in documentation. For example in "Connecting to a preferred region using the SQL API":
The SDK will automatically send all writes to the current write region.
All reads will be sent to the first available region in the PreferredLocations list. If the request fails, the client will fail down the list to the next region, and so on.
What you can do..
Things get more complicated when you also want to distribute writes (espcially if you care about consistency and latency). DocumentDB's own documentation suggest you implement this as a combination of multiple accounts, each of which has its own local write region and automatic distribution to read/fallback node in other regions.
The downside is that your application would have to configure and implement reading from all accounts in your application code and merging the results. Having data well partitioned by geography could help avoid full fan-out at times but your DAL would still have to manage multiple storages internally.
This scenario is explained in more detail in documentation page
"Multi-master globally replicated database architectures with Azure Cosmos DB".
I would seriously consider if adding such complexity would be justified, or if distributing just the reads would suffice.
All regions for a given account have the same replicated data. If you want to separate data across regions, you'd need to split it into two accounts.
Given partition A in the US and partition B in the EU – there is very little difference if A and B were under the same account, or under different accounts… the collection/db/account are all just logical wrappers on top of the partition.
So when a document is deleted, the metadata is actually preserved forever. For a hosted service like cloudant, where storage costs every month, I instead would like to completely purge the deleted documents.
I read somewhere about a design pattern where you use dbcopy in a view to put the docs into a 'current' db then periodically delete the expired dbs. But I cant find the article, and I don't quite understand how database naming would work. How would the cloudant clients always know the 'current' database name?
Cloudant does not expose the _purge endpoint (the loose consistency guarantees between the clustered nodes make purging tricky).
The most common solution to this problem is to create a second database and use replication with a validate_document_update so that deleted documents with no existing entry in the target database are rejected. When replication is complete (or acceptably up-to-date if using continuous replication), switch your application to use the new database and delete the old one. There is currently no way to rename databases but you could use a virtual host which points to the "current" database.
I'd caution that a workload which generates a high ratio of deleted:active documents is generally an anti-pattern in Cloudant. I would first consider whether you can change your document model to avoid it.
Deleted documents are kept for ever in couchdb. Even after compaction .Though the size of document is pretty small as it contains only three fields
{_id:234wer,_rev:123,deleted:true}
The reason for this is to make sure that all the replicated databases are consistent. If a document that is replicated on several databases is deleted from one location there is no way to tell it to other replicated stores.
There is _purge but as explained in the wiki it is only to be used in special cases.
I'm designing an application using Azure Storage Blobs/Table/Queue, handling massive amount of data.
One important aspect of the application, is that work will be done if a given key don't exist, and determining the existence of a key is a frequent and intensive task.
I need to optimize as much possible billable transactions from existence checks of keys.
It could be either against blobs or tables.
I looked at this document Understanding Windows Azure Storage Billing – Bandwidth, Transactions, and Capacity It seems that 404 errors are not counted only from anonymous requests.
I was also thinking of using a BatchTableOperation to check 100 keys at once, maybe using a Replace or Merge, and determine in the results if the key indeed existed (haven't tried, actualy I got the idea while writing)
Any good hack are welcomed.
You should use Windows Azure Caching:
Load all existing keys in the cache
Each time you add a record to Table Storage, also add it to cache
Once you've done that, have your application check cache first. If the item is not present there, check Table Storage just to be sure (to cover edge cases). But 99% of the time, if the item has already been processed the key will be available in the cache and you won't need to query Table Storage (this will drastically reduce transactions to Table Storage).
If using Windows Azure Caching is not an option there are alternatives, like using MemoryCache, save all keys in a file, ...
I need to create incremental reports in the table storage. I need to be able to update the same records from several different worker role instances (different roles with several instances each).
My reports consist mainly of values that I need to increment after I parse the raw data I initially stored.
The optimistic solution I found is to use a retry mechanism: Try to update the record. If you get a 412 result code (you don't have the latest ETAG value), retry. This solution becomes less efficient and more costly the more users you have and the more data you need to update simultaneously (my case exactly).
Another solution that comes to mind is to have only one instance of one worker role that can possibly update any given record. This is very problematic because this means that I will by-design create bottlenecks in my architecture, which is the opposite of the scale I want to reach with Azure.
If anyone here has some best practices in mind for such a use case, I would love to hear it.
Most cloud storages (Table Storage is one of those) do not offer scalable writes on a single entity/blob/whatever. There is no quick-fix for this limitation, as this limitation comes from the core tradeoff that have being made to create cloud storage in the first place.
Basically, a storage unit (entity/blob/whatever) can be updated about once every 20ms, and that's about it. Having a dedicated worker or not will not change anything to this aspect.
Instead, you need to address your task from from a different angle. For counters, the most usual approach is the use of sharded counters (link is for GAE, but you can implement an equivalent behavior on Azure).
Also, another way to ease the pain to go for an asynchronous architecture ala CQRS where the performance constraints you put on the update latency of entities is significantly relaxed.
I believe the approach needs re-architecture. In order to ensure scalability and limit amount of contention, you want to make sure that every write can work optimistically by providing unique Table/PartitionKey/RowKey
If you need those values for reports to be merged together, have a separate process/worker that will post-aggregated/merge the records for reporting purposes. You can use a queue or a timing mechanism to start aggregation/merging