Azure Cosmos DB - incorrect and variable document count

Azure Cosmos DB - incorrect and variable document count - azure

I have inserted exactly 1 million documents in an Azure Cosmos DB SQL container using the Bulk Executor. No errors were logged. All documents share the same partition key. The container is provisioned for 3,200 RU/s, unlimited storage capacity and single-region write.
When performing a simple count query:
select value count(1) from c where c.partitionKey = #partitionKey
I get varying results varying from 303,000 to 307,000.
This count query works fine for smaller partitions (from 10k up to 250k documents).
What could cause this strange behavior?

It's reasonable in cosmos db. Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
So, your error is resulted of bottleneck of RUs setting. The count query is limited by the number for RUs allocated to your collection. The result that you would have received will have a continuation token.
You may have 2 solutions:
1.Surely, you could raise the RUs setting.
2.For cost, you could keep looking for next set of results via continuation token and keep on adding it so that you will get total count.(Probably in sdk)
You could set value of Max Item Count and paginate your data using continuation tokens. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
I imported exactly 30k documents into my database.Then I tried to run the query
select value count(1) from c in Query Explorer. It turns out only partial of total documents every page. So I need to add them all by clicking Next Page button.
Surely, you could do this query in the sdk code via continuation token.

Related

Why is this zero result Cosmos DB query so expensive?

I'm investigating why we're exhausting so many RUs in Cosmos. Our writes are the expected amount of RUs but our reads are through the roof - a magnitude more than our writes. I tried to strip it to the simplest scenario. A single request querying on a partition with no results uses up 2000 RUs. Why is this so expensive?
var query = new QueryDefinition("SELECT * FROM c WHERE c.partitionKey = #partionKey ORDER BY c._ts ASC, c.id ASC")
.WithParameter("#partionKey", id.Value)
using var queryResultSetIterator = container.GetItemQueryIterator<MyType>(query,
requestOptions: new QueryRequestOptions
{
PartitionKey = new PartitionKey(id.Value.ToString()),
});
while (queryResultSetIterator.HasMoreResults)
{
foreach (var response in await queryResultSetIterator.ReadNextAsync())
{
yield return response.Data;
}
}
The partition key of the collection is /partitionKey. The RU capacity is set directly on the container, not shared. We have a composite index matching the where clause - _ts asc, id asc. Although I'm not sure how that would make any difference for returning no records.
Unfortunately the SDK doesn't appear to give you the spent RUs when querying this way so I've been using Azure monitor to observe RU usage.
Is anyone able to shed any light on why this query, returning zero records and limited to a single partition would take 2k RUs?
Update:
I just ran this query on another instance of the database in the same storage account. Both configured identically. DB1 has 0MB in it the, DB2 has 44MB in it. For the exact same operation involving no records returned, DB1 used 111 RUs, DB2 used 4730RUs - over 40 times more for the same no-result queries.
Adding some more detail: The consistency is set to consistent prefix. It's single region.
Another Update:
I've replicated the issue just querying via Azure Portal and it's related to the number of records in the container. Looking at the query stats it's as though it's loading every single document in the container to search on the partition key. Is the partition key not the most performant way to search? Doesn't Cosmos know exactly where to find documents belonging to a partition key by design?
2445.38 RUs
Showing Results
0 - 0
Retrieved document count: 65671
Retrieved document size: 294343656 bytes
Output document count: 0
Output document size: 147 bytes
Index hit document count: 0
Index lookup time: 0 ms
Document load time: 8804.060000000001 ms
Query engine execution time: 133.11 ms
System function execution time: 0 ms
User defined function execution time: 0 ms
Document write time: 0 ms

I eventually got to the bottom of the issue. In order to search on the partition key it needs to be indexed. Which strikes me as odd considering the partition key is used to decide where a document is stored, so you'd think Cosmos would inherently know the location of every partition key.
Including the partition key in the list of indexed items solved my problem. It also explains why performance degraded over time as the database grew in size - it was scanning through every single document.

Azure CosmosDB Document - Bulk Deletion

Recently, I have asked to delete few million records from a total of 14Tb of Cosmos Db data.
When I looked into the internet, I found a stored proc to do the bulk delete and that works based on partition key.
My scenario is, we have the 4 attributes in each document.
1. id
2. number [ Partition Key]
3. startdate
4. enddate
The requirement is to delete the documents based on startdate.
Delete * from c where c.startdate >= '' and c.startdate <=''
The above query goes through all the partition and deletes the records.
I also checked by running the query in Databricks to take the whole CosmosDB records in a temp Dataframe and add TTL attibute and then upsert to Cosmos DB again.
Is there a better way to achieve the same?

Generally speaking, bulk deletion has the methods listed in this article.
Since your data is very huge,maybe bulkDelete.js is not suitable any more. After all, SP has execution time limit.In addition to the solution described in your question, I also suggest that you could use SDK code to encapsulate a method by yourself:
Set the maxItemCount = 100 and EnableCrossPartitionQuery = true in your query request.Meanwhile, you could get continuation token which is for next page data. Process the data in the batch and maybe you could get some snippet of code from .net bulk Delete Library (GeneratePartitionKeyDocumentIdTuplesToBulkDelete and BulkDeleteAsyn)

GET vs Query on Partition Key and Item Key in Cosmos DB

I was reading the Cosmos DB docs on best practices for query performance, and I found the following ambiguous:
With Azure Cosmos DB, typically queries perform in the following order
from fastest/most efficient to slower/less efficient.
GET on a single partition key and item key
Query with a filter clause on a single partition key
Query without an equality or range filter clause on any property
Query without filters
Is there a difference in performance or RUs between a "GET on a single partition key and item key" and a "QUERY on a single partition key and item key". It's not entirely clear to me whether this falls into case #1 or #2 or is somewhere in between.
Basically, I'm asking whether we ever need to use GET at all. The docs don't seem to clarify this anywhere.

A direct GET will be faster. As documented, a 1K document should cost 1 RU to retrieve. You will have a higher RU cost for a query, as you're engaging the query engine.
One caveat: with a direct read (the GET), you will retrieve the entire document. With a query, you can choose the projection of properties. For very large documents, this could result in significant bandwidth savings for your app, when using a query.

Azure Table Storage TableContinuationToken NextTableName purpose

It is sometimes said that when using Azure Tables there is effectively a 3rd key partitioning data - the Table Name itself.
I noticed when executing a Segmented query that the TableContinuationToken has a NextTableName property. What is the purpose of this property? It could be useful if a query could span multiple tables?

It's for segmented queries if the full result can't be returned by a response.
Blockquote from https://learn.microsoft.com/en-us/rest/api/storageservices/query-tables :
A query against the Table service may return a maximum of 1,000 tables at one time and may execute for a maximum of five seconds. If the result set contains more than 1,000 tables, if the query did not complete within five seconds, or if the query crosses the partition boundary, the response includes a custom header containing the x-ms-continuation-NextTableName continuation token. The continuation token may be used to construct a subsequent request for the next page of data. For more information about continuation tokens, see Query Timeout and Pagination.

Azure Table Storage API returns 0 results with Continuation Token

We are using .net Azure storage client library to retrieve data from server. But when we try to retrieve data, the result have only 0 items with a continuation token. When we fetch the next page with this continuation token we again gets the same result. However when we use the 4th continuation token fetched like this, we are getting the proper result with 15 items.( The items count for all requests are 15). This issue is observed only when we tried applying filter conditions. The code used to fetch result is given below
var tableReference = _tableClient.GetTableReference(tableName);
var query = new TableQuery();
query.Where("'DeviceId' eq '99'"); // DeviceId is of type Int32
query.TakeCount = 15;
var resultsQuery = tableReference.ExecuteQuerySegmented(query, token);
var nextToken = resultsQuery.ContinuationToken;
var results = resultsQuery.ToList();

This is expected behavior. From Query Timeout and Pagination:
A query against the Table service may return a maximum of 1,000 items
at one time and may execute for a maximum of five seconds. If the
result set contains more than 1,000 items, if the query did not
complete within five seconds, or if the query crosses the partition
boundary, the response includes headers which provide the developer
with continuation tokens to use in order to resume the query at the
next item in the result set. Continuation token headers may be
returned for a Query Tables operation or a Query Entities operation.
I noticed that you're not using PartitionKey in your query. This will result in full table scan. Recommendation would be to always use PartitionKey (and possibly RowKey) in your queries to avoid full table scans. I would highly recommend reading Azure Storage Table Design Guide: Designing Scalable and Performant Tables to get the most out of Azure Tables.
UPDATE: Explaining "If the query crosses the partition boundary"
Let me try with an example as to what I understand by Partition Bounday. Let's assume you have 1 million rows in your table evenly spread across 10 Partitions (let's assume your PartitionKeys are 001, 002, 003,...010). Now we know that the data in Azure Tables is organized by PartitionKey and then in a Partition by RowKey. Since in your query you did not specify PartitionKey, Table Service starts from 1st Partition (i.e. PartitionKey == 001) and tries to find the matching data there. If it does not find any data in that Partition, it does not know whether the data is there in another Partition so instead of going to the next Partition, it simply returns back with a continuation token and leave it to the client consuming the API to decide whether they want to continue the search using the same parameters + continuation token or revise their search to start again.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string