I have a collection with ~8000 documents that I paginate in my application. However, my query to get the count of total documents (to calculate page count), is blowing my RU/s quota out of the water.
The find query only takes about 3 RU/s, but takes a while to execute...
Is there a solution to this?
db.orders.count({"user": ObjectId("5ca51dc1234c0b21dcxxa12c")}})
Operation consumed 442.62 RUs
5958
It will search all the partitions in your container, Pass the partition Key with your query inorder to make the search easier, otherwise it has to go through all the parititions in the container, that will result in lot of RUs.
If the find query takes a lot of time, then make sure that you have an index on this field. Currently they seem to be added automatically for each field, so it shouldn't be an issue.
However, from my own recent experience with Cosmos DB, if count returns a large number, it will also consume a lot of RUs. See: https://stackoverflow.com/a/60512604/4619705 for more info.
Related
We have a cosmos-db container which has about 1M records containing information about customers. The partition key for the documentDb is customerId which holds a unique GUID reference for the customer. I have read the partitioning and scaling document which would suggest that our choice of key appears appropriate, however if we want to query this data using a field such as DOB or Address, the query will be considered as a cross-partition query and will essentially send the same query to every record in the documentDb before returning.
The query stats in Data Explorer suggests that a query on customer address will return the first 200 documents at a cost of 36.9 RU's but I was under the impression that this would be far higher given the amount of records that this query would be sent to? Are these query stats accurate?
It is likely that we will want to extend our app to be able to query on multiple non-partition data elements so are we best replicating the customer identity and searchable data element within another documentDb using the desired searchable data element as the partition key. We can then return the identities of all customers who match the query. This essentially changes the query to be an in-partition query and should prevent additional expenditure?
Our current production database has a 4000 (Max Throughput)(Shared) so there appears to be adequate provision for cross-partition queries so would I be wasting my time building out a change-feed to maintain a partitioned representation of the data to support in-partition queries over cross-partition queries?
To get accurate estimate of query cost you need to do the measurement on a container that has a realistic amount of data within it. For example, if I have a container with 5000 RU/s and 5GB of data my cross-partition query will be fairly inexpensive because it only ran on a single physical partition.
If I ran that same query on a container with 100,000 RU/s I would have > 10 physical partitions and the query would show much greater RU/s reported due to the query having to execute across all 10 physical partitions. (Note: 1 physical partition has maximum 10,000 RU/s or 50GB of storage).
It is impossible to say at what amount of RU/s and storage you will begin to get a more realistic number for RU charges. I also don't know how much throughput or storage you need. If the workload is small then maybe you only need 10K RU and < 50GB of storage. It's only when you need to scale out that is where you need to first scale out, then measure your query's RU charge.
To get accurate query measurements, you need to have a container with the throughput and amount of data you would expect to have in production.
You don't necessarily need to be afraid of cross-partition queries in CosmosDB. Yes, single-partition queries are faster, but if you need to query "find any customers matching X" then cross-partition query is naturally required (unless you really need the hassle of duplicating the info elsewhere in optimized form).
The cross-partition query will not be sent to "each document" as long as you have good indexes in partitions. Just make sure every query has a predicate on a field that is:
indexed
with good-enough data cardinality
.. and the returned number of docs should be limited by business model or forced (top N). This way your RU should be more-or-less top-bound.
36RU per 200 returned docs does not sound too bad as long as it's not done too many times per sec. But if in doubt, test with predicted data volume and fire up some realistic queries..
I have migrated all my databases to CosmosDB with Mongo API by removing indices. After the migration, I started creating indices manually on CosmosDB. I have a collection called to order. It has 7 Million documents each document is nearly 1 KB in size. But as I update the index it is taking a lot of time. I am checking the index update status. It is been 30 minutes still the update is only 40 % complete. is this index update is a lot of RU consuming operation.? I know we have a limitation of 5000 RU/s per container. So is this slowness is because of that. If someone knows the answer to this please help me. And also, will the Azure cost me for the RU's that I consume during an index update. I have read somewhere that it won't.
I'm a beginner to Azure. I'm using log monitors to view the logs for a Cosmos DB resource. I could see one log with Replace operation which is consuming a lot of average RUs.
Generally operation names should be CREATE/DELETE/UPDATE/READ . But why REPLACE operation has come in place over here - I could not understand it. And why the REPLACE operation is consuming lot of RUs?
What can I try next?
Updates in Cosmos are full replacement operations rather than in-place updates, as such these consume more RU/s than inserts. Also, the larger the document, the more throughput required for the update.
Strategies to optimize throughput consumption on update operations typically center around splitting documents into two with properties that don't change going into one document that is typically larger and another document with those properties that change frequently going into an other that is smaller. This will allow for the updates to be made on a smaller document which reduces RU/s consumed to do the operation.
All that said, 12 RU/s is not an inordinate amount of RU/s for a replace operation. I don't think you will get much, if any throughput reduction doing this. But you can certainly try.
We would like to store a set of documents in Cosmos DB with a primary key of EventId. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize). This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId and int(EventId / PartitionSize), then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
Clients will access the events by specifying a list of CustomerId's, the last EventId they received, and a maximum number of records to retrieve.
For this reason, the use of EventId alone won't perform well, as it will result in a cross partition query (i.e. WHERE EventId > LastEventId).
The system will probably be writing on the order of 1GB a day, in 15 minute increments.
It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.
So first thing first, logical partitions size limit has now been increased to 20GB, please see here.
You can use EventID as a partition as well, as you have limit of logical partition's size in GB but you have no limit on amount of logical partitions. So using EventID is fine, you will get a point to point read which is very fast if you query using the EventID. Now you mention using this way you will have to do cross-partition queries, can you explain how?
Few things to keep in mind though, Cosmos DB is not really meant for storing this kind of Log based data as it stores everything in SSDs so please calculate how much is your 1 document size and how many in a second would you have to store then how much in a day to how much in a month. You can use TTL to delete from Cosmos when done though and for long term storage store it in Azure BLOB Storage and for fast retrievals use Azure Search to query the data in BLOB by using CustomerID and EventID in your search query.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
I faced a similar issue some time back and a PartitionKey with customerId + datekey e.g. cust1_20200920 worked well for me.
I created the date key as 20200920 (YYYYMMDD), but you can choose to ignore the date part or even the month (cust1_202009 /cust1_2020), based on your query requirement.
Also, IMO, if there are multiple known PartitionKeys at a query time it's kind of a good thing. For example, if you keep YYYYMM as the PartitionKey and want to get data for 4 months, you can run 4 queries in parallel and combine the data. Which is faster if you have many clients and these Partition Keys are distributed among multiple physical partitions.
On a separate note, Cosmos Db has recently introduced an analytical store for the transactional data which can be useful for your use case.
More about it here - https://learn.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
One approach is using multiple Cosmos containers as "hot/cold" tiers with different partitioning. We could use two containers:
Recent: all writes and all queries for recent items go here. Partitioned by CustomerId.
Archive: all items are copied here for long term storage and access. Partitioned by CustomerId + timespan (e.g. partition per calendar month)
The Recent container would provide single partition queries by customer. Data growth per partition would be limited either by setting reasonable TTL during creation, or using a separate maintenance job (perhaps Azure Function on timer) to delete items when they are no longer candidates for recent-item queries.
A Change Feed processor, implemented by an Azure Function or otherwise, would trigger on each creation in Recent and make a copy into Archive. This copy would have partition key combining the customer ID and date range as appropriate to limit the partition size.
This scheme should provide efficient recent-item queries from Recent and safe long-term storage in Archive, with reasonable Archive query efficiency given a desired date range. The main downside is two writes for each item (one for each container) -- but that's the tradeoff for efficient polling. Whether this tradeoff is worthwhile is probably best determined by simulating the load and observing performance.
I'm using ArangoDB for a Web Application through Strongloop.
I've got some performance problem when I run this query:
FOR result IN Collection SORT result.field ASC RETURN result
I added some index to speed up the query like skiplist index on the field sorted.
My Collection has inside more than 1M of records.
The application is hosted on n1-highmem-2 on Google Cloud.
Below some specs:
2 CPUs - Xeon E5 2.3Ghz
13 GB of RAM
10GB SSD
Unluckly, my query spend a lot of time to ending.
What can I do?
Best regards,
Carmelo
Summarizing the discussion above:
If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running
db.Collection.getIndexes();
in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain.
However, the query will still build a huge result in memory which will take time and consume RAM.
If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.
For example, first iteration:
FOR result IN Collection SORT result.field LIMIT 10000 RETURN result
Then process these first 10,000 documents offline, and note the result value of the last processed document.
Now run the query again, but now with an additional FILTER:
FOR result IN Collection
FILTER result.field > #lastValue LIMIT 10000 RETURN result
until there are no more documents. That should work fine if result.field is unique.
If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.
Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.