cosmosdb bulkdelete without partitionkey - azure

Am trying to do a queried document bulk delete using storedprocedure on cosmosdb collection. I have used sample code from here .
https://github.com/Azure/azure-documentdb-js-server/blob/master/samples/stored-procedures/bulkDelete.js
When I try to execute the query, am forced to provide a partition key which I do not know. I want to execute a fan out delete query based on query criteria which do not include the partition key. What are other ways I can try and delete documents in bulk from a cosmosdb collection ?

If the collection the stored procedure is registered against is a
single-partition collection, then the transaction is scoped to all the
documents within the collection. If the collection is partitioned,
then stored procedures are executed in the transaction scope of a
single partition key. Each stored procedure execution must then
include a partition key value corresponding to the scope the
transaction must run under.
You could refer to the description above which mentioned here.
Surely, if your collection has been partitioned, you also need to offer partition key when you operate the collections or documents in it. More details from here.
So, base on your situation that you do not know the partition key, I suggest you set EnableCrossPartitionQuery to true in the FeedOptions when executing deletion.(has performance bottleneck)
Hope it helps you.

Related

Azure CosmosDB Document - Bulk Deletion

Recently, I have asked to delete few million records from a total of 14Tb of Cosmos Db data.
When I looked into the internet, I found a stored proc to do the bulk delete and that works based on partition key.
My scenario is, we have the 4 attributes in each document.
1. id
2. number [ Partition Key]
3. startdate
4. enddate
The requirement is to delete the documents based on startdate.
Delete * from c where c.startdate >= '' and c.startdate <=''
The above query goes through all the partition and deletes the records.
I also checked by running the query in Databricks to take the whole CosmosDB records in a temp Dataframe and add TTL attibute and then upsert to Cosmos DB again.
Is there a better way to achieve the same?
Generally speaking, bulk deletion has the methods listed in this article.
Since your data is very huge,maybe bulkDelete.js is not suitable any more. After all, SP has execution time limit.In addition to the solution described in your question, I also suggest that you could use SDK code to encapsulate a method by yourself:
Set the maxItemCount = 100 and EnableCrossPartitionQuery = true in your query request.Meanwhile, you could get continuation token which is for next page data. Process the data in the batch and maybe you could get some snippet of code from .net bulk Delete Library (GeneratePartitionKeyDocumentIdTuplesToBulkDelete and BulkDeleteAsyn)

Adding Partition key to the existing collection- Azure Cosmos DB

Is there any way that we can add Partition Keys for the Collections we already have in Azure-Cosmos DB, or we need to drop them and create new collections with partition keys and import the data from the previous collections?
I tried googling a lot and checking the settings of the collection but nothing helped. if you could that would be great, thanks in advance.
Once created, a collections partition key definition cannot change. This means that you cannot add, remove or alter the partition key of a collection once created.
You can use the Cosmos DB change feed to migrate to a new collection with the appropriate partition key.

GET vs Query on Partition Key and Item Key in Cosmos DB

I was reading the Cosmos DB docs on best practices for query performance, and I found the following ambiguous:
With Azure Cosmos DB, typically queries perform in the following order
from fastest/most efficient to slower/less efficient.
GET on a single partition key and item key
Query with a filter clause on a single partition key
Query without an equality or range filter clause on any property
Query without filters
Is there a difference in performance or RUs between a "GET on a single partition key and item key" and a "QUERY on a single partition key and item key". It's not entirely clear to me whether this falls into case #1 or #2 or is somewhere in between.
Basically, I'm asking whether we ever need to use GET at all. The docs don't seem to clarify this anywhere.
A direct GET will be faster. As documented, a 1K document should cost 1 RU to retrieve. You will have a higher RU cost for a query, as you're engaging the query engine.
One caveat: with a direct read (the GET), you will retrieve the entire document. With a query, you can choose the projection of properties. For very large documents, this could result in significant bandwidth savings for your app, when using a query.

Role of partition key in Cosmos DB Sql API Insert? With the Bulk Executor?

I'm trying to repeatedly insert about 850 documents between 100 - 300Kb into a cosmos collection. I have them all in the same partition key.
The estimator suggests that at 50K RUs should handle this in short order but at well over 100k its averaging 20 minutes or so per set rather than something more reasonable.
Should I have unique partition keys for each document? Is the problem that having the all the documents going to the same partition key, they are being handled in series and the capacity isn't load leveling?
Will using the bulk executor fix this?
Should I have unique partition keys for each document? Is the problem
that having the all the documents going to the same partition key,
they are being handled in series and the capacity isn't load leveling?
You could find below statement from this doc.
To fully utilize throughput provisioned for a container or a set of
containers, you must choose a partition key that allows you to evenly
distribute requests across all distinct partition key values.
So, I think defining partition key is good for insert or query.However, the choosing of partition key is really worth a dig.Please refer to this doc to choose your partition key.
Will using the bulk executor fix this?
Yes,you could use continuation token in bulk insert.More details ,please refer to my previous case:How do I get a continuation token for a bulk INSERT on Azure Cosmos DB?.
Hope it helps you.
Just for summary, we need to evaluate the default indexes for collection.It may take 100 to 1000x more RUs than actually writing the file.

Azure Cosmos DB asking for partition key for stored procedure

I am using GUID Id as my partition key and I am facing problem when I am trying to run a stored procedure. To run a store procedure I need to provide partition key ans I am not sure what value should I provide in this case? Please assist.
If the collection the stored procedure is registered against is a
single-partition collection, then the transaction is scoped to all the
documents within the collection. If the collection is partitioned,
then stored procedures are executed in the transaction scope of a
single partition key. Each stored procedure execution must then
include a partition key value corresponding to the scope the
transaction must run under.
You could refer to the description above which mentioned here.
As #Rafat Sarosh said, GUID Id is not an appropriate partitioning key. Based on your situation , city may be more appropriate.You may need to adjust your database partitioning scheme because the partitioning key can not be deleted or modified after you have defined it.
I suggest you exporting your data to json file then import to a new collection which is partitioned by city via Azure Cosmos DB Data migration tool.
Hope it helps you.
Just for summary:
Issue:
Unable to provide specific partition key value when executing sql to query documents.
Solution:
1.Set EnableCrossPartitionQuery to true when executing query sql.(has performance bottleneck)
2.Consider setting a frequently queried field as a partitioning key.
Example your partition key is /id
and your cosmos document is
{
"id" : abcde
}
When store procedure run, you need to paste: abcde value
So if you want your store procedure running cross partition, it can't
Answer from cosmos team
https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/33550159-support-stored-procedure-execution-over-all-partit

Resources