How to reduce reserved RUs to reduce cost of DocumentDB - azure

We are using DocumentDB on azure. We have a single database with 7 collection, each having maximum 15 records. It does not require much storage.
Only a few developers are using this DB instance. So traffic is also below average.
Still this server is using 67,600 RUs per day. There must be some problem with DocumentDB settings. So, I'm looking for direction to analyse exactly how these RUs are charged and how to reduce it?

There's no problem with DocumentDB settings. You provisioned 7 collections. By default, via the portal, each collection is assigned 1000 RU (which you have at your disposal, regardless whether you use 0 RU or all 1000 RU). The minimum RU setting for a non-partitioned collection is 400.
EDIT - I misread - if you're at 67,000 RU, then you have likely provisioned several partitioned collections (which start at 10,100 RU). For initial dev/test, with only 15 documents, you've grossly over-allocated capacity.
Since you provisioned seven collections (which are likely partitioned, based on your RU sizing), you have a ~70,000 RU deployment. Regardless what you actually consume (you're essentially reserving capacity).
I have no idea what your app needs are, and whether you need 7 collections for some specific reason. But... objectively speaking, there is no rule that says you need to separate different document types into different collections. You can easily store heterogeneous data within a single collection. How you query for specific types is really up to you, but it's trivial to add something like a type property to each document).
Note, since I now believe you're using partitioned collections: You cannot convert these to non-partitioned collections; you'll need to create new non-partitioned collections and move your data from your partitioned collections. (given that you have 15 total documents, this should be trivial).
Note that a single non-partitioned collection may be scaled down to 400 RU. If you then combine your 7 collections into 1 collection, you should be able to reduce your consumption from ~70,000 => 400. (at least during dev/test).
EDIT As of February 2017, the minimum RU for partitioned collections dropped to 2,500 (from the original 10,100 minimum). In December 2017, it dropped again, to 1,000.

It's common for people new to DocumentDB to think of a collection similar to a table in SQL or even what MongoDB calls a "collection". However, DocumentDB is designed differently. It's best to use a single partitioned collection to store all document types and partition on something like geography, tenant, or user. You'll distinguish document types with a type = <MyType> field or I actually prefer to use myType = true approach so I can model inheritance and mixins.
This means, you'll only need to pay for a single partitioned collection. A single partitioned collection may still end up costing you more than table storage, but if you want DocumentDB's near infinite scalability later on, then I highly recommend you start out the way I'm describing.
One more note about David's suggestion to go with non-partitioned collections. That was the only option when DocumentDB first launched but it's now recommended to use partitioned collections. I suspect that non-partitioned collection option may be phased out at some point. You interact with them slightly differently and as David pointed out, there is currently no conversion assistance (especially if you use multiple non-partitioned collections) so transitioning later from non-partitioned collections to a partitioned collection is not hard but it's not as simple as changing your partition type and will cost you development effort. It'll cost you a little more to have a single partitioned collection than a single non-partitioned collection, but it's worth it to save transition costs later, IMHO and it'll cost you less to have a single partitioned collection than it costs to have seven non-partitioned ones.

Related

Cosmos DB partition key and query design for sequential access

We would like to store a set of documents in Cosmos DB with a primary key of EventId. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize). This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId and int(EventId / PartitionSize), then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
Clients will access the events by specifying a list of CustomerId's, the last EventId they received, and a maximum number of records to retrieve.
For this reason, the use of EventId alone won't perform well, as it will result in a cross partition query (i.e. WHERE EventId > LastEventId).
The system will probably be writing on the order of 1GB a day, in 15 minute increments.
It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.
So first thing first, logical partitions size limit has now been increased to 20GB, please see here.
You can use EventID as a partition as well, as you have limit of logical partition's size in GB but you have no limit on amount of logical partitions. So using EventID is fine, you will get a point to point read which is very fast if you query using the EventID. Now you mention using this way you will have to do cross-partition queries, can you explain how?
Few things to keep in mind though, Cosmos DB is not really meant for storing this kind of Log based data as it stores everything in SSDs so please calculate how much is your 1 document size and how many in a second would you have to store then how much in a day to how much in a month. You can use TTL to delete from Cosmos when done though and for long term storage store it in Azure BLOB Storage and for fast retrievals use Azure Search to query the data in BLOB by using CustomerID and EventID in your search query.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
I faced a similar issue some time back and a PartitionKey with customerId + datekey e.g. cust1_20200920 worked well for me.
I created the date key as 20200920 (YYYYMMDD), but you can choose to ignore the date part or even the month (cust1_202009 /cust1_2020), based on your query requirement.
Also, IMO, if there are multiple known PartitionKeys at a query time it's kind of a good thing. For example, if you keep YYYYMM as the PartitionKey and want to get data for 4 months, you can run 4 queries in parallel and combine the data. Which is faster if you have many clients and these Partition Keys are distributed among multiple physical partitions.
On a separate note, Cosmos Db has recently introduced an analytical store for the transactional data which can be useful for your use case.
More about it here - https://learn.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
One approach is using multiple Cosmos containers as "hot/cold" tiers with different partitioning. We could use two containers:
Recent: all writes and all queries for recent items go here. Partitioned by CustomerId.
Archive: all items are copied here for long term storage and access. Partitioned by CustomerId + timespan (e.g. partition per calendar month)
The Recent container would provide single partition queries by customer. Data growth per partition would be limited either by setting reasonable TTL during creation, or using a separate maintenance job (perhaps Azure Function on timer) to delete items when they are no longer candidates for recent-item queries.
A Change Feed processor, implemented by an Azure Function or otherwise, would trigger on each creation in Recent and make a copy into Archive. This copy would have partition key combining the customer ID and date range as appropriate to limit the partition size.
This scheme should provide efficient recent-item queries from Recent and safe long-term storage in Archive, with reasonable Archive query efficiency given a desired date range. The main downside is two writes for each item (one for each container) -- but that's the tradeoff for efficient polling. Whether this tradeoff is worthwhile is probably best determined by simulating the load and observing performance.

Why should I not put all my data in one CosmosDB collection?

The problem
I have discovered that Cosmos DB is priced very aggressively and can be expensive if used with many data types.
I would think that a good structure, would be to put each data type I have in their own collection, almost like tables in a database (not quite).
However, each collection costs at least 24 USD per month. This is if I choose "Fixed", that limits me to 10GB and is NOT scalable. Hardly the point of Cosmos DB, so I would rather choose "Unlimited". However, here the price is at least 60 USD per month.
60 USD per month per data type.
This includes 1000 RU, but on top of this, I have to pay more for consumption.
This might be OK if I have a few data types, but if I a fully fledged business application with 30 data types (not at all uncommon), it becomes 1800 USD per month, at least. As a starting price. When I have no data yet.
The question
The structure of the data in the collection is not strict. I can store different types of documents in the same collection.
When using an "Unlimited" collection, I can use partition keys, which should be used to partition my data to ensure scalability.
However, why do I not just include the data type in the partition key?
Then the partition key becomes something like:
[customer-id]-[data-type]-[actual-partition-value, like 'state']
With one swift move, my minimum cost becomes 60 USD and the rest is based on consumption. Presumably, partition keys ensure satisfactory performance regardless of the data volume. So what am I missing? Is there some problem with this approach?
Update
Microsoft now supports sharing RU across all containers (without a minimum of 10000 RU) so this question is essentially no longer relevant, as you can now freely choose to separate data into different containers without any extra cost.
No there will be no problem per se.
It all boils down to whether you're fine with having 1000 RU/s, or more specifically a single bottleneck, for your whole system.
In fact you can simplify this even more by having your document id to be the partition key. This will guarantee the uniqueness of the document id and will enable the maximum possible distribution and scale in CosmosDB.
That's exactly how collection sharing works in Cosmonaut (disclaimer, I'm the creator of this project) and I have noticed no problems, even on systems with many different data types.
However you have to keep in mind that even though you can scale this collection up and down you still restrict your whole system with this one bottleneck. I would recommend that you don't just create one collection but probably 2 or 3 collections with shared entities in them. If this is done smart enough and you batch entities in a logical way then you can scale your throughput for specific parts of your system.

Cosmos DB Graph Edge partitioning

Cosmos DB has pre-announced general availability of Gremlin (Graph API). Probably by the end of 2017 it will get out of preview, so we might consider it stable enough for production. That brings me to the following:
We are designing a system with an estimated user-base up to 100 million users. Each user will have some documents in Cosmos to store user-related data, those documents are partitioned on the id of the user (a Guid). So when estimations come true we will end up with at least 100 million partitions, each containing a bunch of documents.
Not only will we store user-related data but also interrelated data (relationships) between users. On paper Cosmos should be very well suited for these kinds of scenarios, utilizing it cross-api with Document API for normal data and Graph API purely for the relationships.
An example of one of these relationships is a Follow. For instance UserX can Follow UserY. To realize this relationship, we created a Gremlin query that creates an Edge:
g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}')
.addE('follow').to(g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}'))
The resulting Edge automatically gets assigned to the partition of UserX, because UserX is the out-vertex.
When querying on outgoing edges (all the users that UserX is following), all is fine and well because the query is limited to the partition for UserX.
g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}').outE('follow').inV()
However when inverting the query (find all followers of UserY), looking for incoming edges, the situation changes - to my knowledge this will result in a full cross-partition query:
g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}').inE('follow').outV()
In my opinion a full cross-partition query with 100 million partitions is unacceptable.
I have tried putting the Edge between UserX and UserY inside its own partition, but the Graph API does not let me do this. (Edit: Changed Cosmos to Graph API)
Now I have come to the point of implementing a pair of edges between UserX and UserY, one outgoing Edge for UserX and one outgoing Edge for UserY, trying to keep them in-sync. All this in order to optimize the speed of my queries, but also introducing more work to achieve eventual consistency.
Then again I am wondering if the Graph API is really up to these kinds of scenario's - or I am really missing on something here?
I will start by clearing a slight misconception you have regarding CosmosDB partitioning. 100 Million users doesn’t mean 100 million partitions. They simply mean 100 million partition keys. When you create a cosmos dB graph it starts with 10 physical partitions ( this is starting default which can be changed upon request), and then scales automatically as data grows.
In this case 100 million users will be distributed among 10 physical partitions. Hence the full cross partition query will hit on 10 physical partition. Also note that these partitions will be hit in parallel, so the expected latency would be similar to hitting one partition, unless operation is similar to aggregates in nature.
This is a classic partitioning dilemma, not unique to Cosmos/Graph.
If your usage pattern is lots of queries with small scope then cross-partition is bad. If it is returning large data sets then cross-partition overhead is probably insignificant against the benefits of parallelism. Unless you have a constant high volume of queries then I think the cross-partition overhead is overstated (MS seem to think everyone is building the next Facebook on Cosmos).
In the OP case you can optimise for x follows y, or x is followed by y, or both by having an edge each way. Note that RUs are reserved on a per partition basis (i.e. total RU / number of partitions) so to use them efficiently you need either high volume, evenly distributed, single partition queries or queries that span multiple partitions.

is it good to use different collections in a database in mongodb

I am going to do a project using nodejs and mongodb. We are designing the schema of database, we are not sure that whether we need to use different collections or same collection to store the data. Because each has its own pros and cons.
If we use single collection, whenever the database is invoked, total collection will be loaded into memory which reduces the RAM capacity.If we use different collections then to retrieve data we need to write different queries. By using one collection retrieving will be easy and by using different collections application will become faster. We are confused whether to use single collection or multiple collections. Please Guide me which one is better.
Usually you use different collections for different things. For example when you have users and articles in the systems, you usually create a "users" collection for users and "articles" collection for articles. You could create one collection called "objects" or something like that and put everything there but it would mean you would have to add some type fields and use it for searches and storage of data. You can use a single collection in the database but it would make the usage more complicated. Of course it would let you to load the entire collection at once but whether or not it is relevant for the performance of your application, that is something that would have to be profiled and tested to give your the performance impact for your particular use case.
Usually, developers create the different collection for different things. Like for post management, people create 'post' collection and save the posts in post collection and same for users and all.
Using different collection for different purpose is a good pratices.
MongoDB is great at scaling horizontally. It can shard a collection across a dynamic cluster to produce a fast, querable collection of your data.
So having a smaller collection size is not really a pro and I am not sure where this theory comes that it is, it isn't in SQL and it isn't in MongoDB. The performance of sharding, if done well, should be relative to the performance of querying a single small collection of data (with a small overhead). If it isn't then you have setup your sharding wrong.
MongoDB is not great at scaling vertically, as #Sushant quoted, the ns size of MongoDB would be a serious limitation here. One thing that quote does not mention is that index size and count also effect the ns size hence why it describes that:
By default MongoDB has a limit of approximately 24,000 namespaces per
database. Each namespace is 628 bytes, the .ns file is 16MB by
default.
Each collection counts as a namespace, as does each index. Thus if
every collection had one index, we can create up to 12,000
collections. The --nssize parameter allows you to increase this limit
(see below).
Be aware that there is a certain minimum overhead per collection -- a
few KB. Further, any index will require at least 8KB of data space as
the b-tree page size is 8KB. Certain operations can get slow if there
are a lot of collections and the meta data gets paged out.
So you won't be able to gracefully handle it if your users exceed the namespace limit. Also it won't be high on performance with the growth of your userbase.
UPDATE
For Mongodb 3.0 or above using WiredTiger storage engine, it will no longer be the limit.
Yes personally I think having multiple collections in a DB keeps it nice and clean. The only thing I would worry about is the size of the collections. Collections are used by a lot of developers to cut up their db into, for example, posts, comments, users.
Sorry about my grammar and lack of explanation I'm on my phone

DocumentDB (via MongoDB protocol) collection size limit in azure

Without partitioning there is 10GB limit on each collection in Azure in MongoDB(Used the drivers on top of DocumentDB) and I have a collection whose size is 50GB.
Currently I have divided data on basis of a field and stored them in 6 different collections.
Should I be doing the partitioning (Don't know how to do it) or there is a way to increase this size limit?
DocumentDB collection management really has nothing to do with MongoDB access protocol. Collections are either non-partitioned (10GB cap) or partitioned (250GB and beyond).
How you divide your data between collections is up to you. But keep these things in mind when deciding between multiple non-partitioned collections and a single partitioned collection:
The collection serves as a partition boundary, which includes stored procedures. If you need to work with content across collections, this could be an issue with your app, depending on its logic.
Non-partitioned collections have Request Unit (RU) scale from 400-10,000. Partitioned collections start at 10,100 2,500 RU. Depending on your app budget, this could impact your collection decision.
You cannot convert a collection from non-partitioned <--> partitioned. If you decide to change the collection type, you'll need to create a new collection and move data between collections.

Resources