Azure CosmosDB Document - Bulk Deletion - azure

Recently, I have asked to delete few million records from a total of 14Tb of Cosmos Db data.
When I looked into the internet, I found a stored proc to do the bulk delete and that works based on partition key.
My scenario is, we have the 4 attributes in each document.
1. id
2. number [ Partition Key]
3. startdate
4. enddate
The requirement is to delete the documents based on startdate.
Delete * from c where c.startdate >= '' and c.startdate <=''
The above query goes through all the partition and deletes the records.
I also checked by running the query in Databricks to take the whole CosmosDB records in a temp Dataframe and add TTL attibute and then upsert to Cosmos DB again.
Is there a better way to achieve the same?

Generally speaking, bulk deletion has the methods listed in this article.
Since your data is very huge,maybe bulkDelete.js is not suitable any more. After all, SP has execution time limit.In addition to the solution described in your question, I also suggest that you could use SDK code to encapsulate a method by yourself:
Set the maxItemCount = 100 and EnableCrossPartitionQuery = true in your query request.Meanwhile, you could get continuation token which is for next page data. Process the data in the batch and maybe you could get some snippet of code from .net bulk Delete Library (GeneratePartitionKeyDocumentIdTuplesToBulkDelete and BulkDeleteAsyn)

Related

Managing multiple database connections and data with foreachPartition

Will try to make it as clear as possible so an example isn't required as this has to be a concept that I didn't grasp properly and I'm struggling with rather than a problem with data or Spark code itself.
I'm required to insert city data within their own database (MongoDB) and I'm trying to perform those upserts as fast as possible.
Take into account a sample DataFrame with the following, where I want to do some upserts against MongoDB based on, for example, year, city and zone.
year - city - zone - num_business - num_vehicles.
Having groupedBy those columns I'm just pending to perform the upsert into the DB.
Using the MongoDB Driver I'm required to instantiate several WriteConfigs to cope with multiple databases (1 database per city).
// the 'getDatabaseWriteConfigsPerCity' method filters the 'df' so it only contains the docs from a single city.
for (cityDBConnection <- getDatabaseWriteConfigsPerCity(df) {
cityDBConnection.getDf.foreach(
... // set MongoDB upsert criteria.
)
}
Doing it that way works but still, more performance can be gained when using foreachPartition as I hope that those records within the DF are spread to the executors are more data is concurrently being upsert.
However, I get erroneous results when using foreachPartition. Erroneus because they seem incomplete. Counters are way off and such.
I suspect this is because, among the partitions, same keys are in different partitions and it's not until those are merged in the master when those are inserted to MongoDB as a single record.
Is there any way I can make sure partitions contain the total of documents related to an upsert key?
Don't really know if I'm being clear enough, but if it's still too complicated I will update as soon as possible.
Is there any way I can make sure partitions contain the total of
documents related to an upsert key? if you do:
df.repartition("city").foreachPartition{...}
You can be sure that all records with same city are in the same partition (but there is probably more than 1 city per partition!)

Azure Cosmos DB - incorrect and variable document count

I have inserted exactly 1 million documents in an Azure Cosmos DB SQL container using the Bulk Executor. No errors were logged. All documents share the same partition key. The container is provisioned for 3,200 RU/s, unlimited storage capacity and single-region write.
When performing a simple count query:
select value count(1) from c where c.partitionKey = #partitionKey
I get varying results varying from 303,000 to 307,000.
This count query works fine for smaller partitions (from 10k up to 250k documents).
What could cause this strange behavior?
It's reasonable in cosmos db. Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
So, your error is resulted of bottleneck of RUs setting. The count query is limited by the number for RUs allocated to your collection. The result that you would have received will have a continuation token.
You may have 2 solutions:
1.Surely, you could raise the RUs setting.
2.For cost, you could keep looking for next set of results via continuation token and keep on adding it so that you will get total count.(Probably in sdk)
You could set value of Max Item Count and paginate your data using continuation tokens. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
I imported exactly 30k documents into my database.Then I tried to run the query
select value count(1) from c in Query Explorer. It turns out only partial of total documents every page. So I need to add them all by clicking Next Page button.
Surely, you could do this query in the sdk code via continuation token.

Get all the Partition Keys in Azure Cosmos DB collection

I have recently started using Azure Cosmos DB in our project. For the reporting purpose, we need to get all the Partition Keys in the collection. I could not find any suitable API to achieve it.
UPDATE: According to Brian in the comments below, DISTINCT is now supported. Try something like:
SELECT DISTINCT c.partitionKey FROM c
Prior answer: Idea that could work but for one thing...
The only way to get the actual partition key values is to do a unique aggregate on that field.
You can directly hit the REST endpoint at https://{your endpoint domain}.documents.azure.com/dbs/{your collection's uri fragment}/pkranges to pull back the minInclusive and maxExclusive ranges for each partition but those are hash space ranges and I don't know how to convert those into partition key values nor do a fanout using the actual minInclusive hash.
Also, there is a slim possibility that the pkranges can change between the time you retrieve them and the time you go to do something with them.

Azure Table Storage API returns 0 results with Continuation Token

We are using .net Azure storage client library to retrieve data from server. But when we try to retrieve data, the result have only 0 items with a continuation token. When we fetch the next page with this continuation token we again gets the same result. However when we use the 4th continuation token fetched like this, we are getting the proper result with 15 items.( The items count for all requests are 15). This issue is observed only when we tried applying filter conditions. The code used to fetch result is given below
var tableReference = _tableClient.GetTableReference(tableName);
var query = new TableQuery();
query.Where("'DeviceId' eq '99'"); // DeviceId is of type Int32
query.TakeCount = 15;
var resultsQuery = tableReference.ExecuteQuerySegmented(query, token);
var nextToken = resultsQuery.ContinuationToken;
var results = resultsQuery.ToList();
This is expected behavior. From Query Timeout and Pagination:
A query against the Table service may return a maximum of 1,000 items
at one time and may execute for a maximum of five seconds. If the
result set contains more than 1,000 items, if the query did not
complete within five seconds, or if the query crosses the partition
boundary, the response includes headers which provide the developer
with continuation tokens to use in order to resume the query at the
next item in the result set. Continuation token headers may be
returned for a Query Tables operation or a Query Entities operation.
I noticed that you're not using PartitionKey in your query. This will result in full table scan. Recommendation would be to always use PartitionKey (and possibly RowKey) in your queries to avoid full table scans. I would highly recommend reading Azure Storage Table Design Guide: Designing Scalable and Performant Tables to get the most out of Azure Tables.
UPDATE: Explaining "If the query crosses the partition boundary"
Let me try with an example as to what I understand by Partition Bounday. Let's assume you have 1 million rows in your table evenly spread across 10 Partitions (let's assume your PartitionKeys are 001, 002, 003,...010). Now we know that the data in Azure Tables is organized by PartitionKey and then in a Partition by RowKey. Since in your query you did not specify PartitionKey, Table Service starts from 1st Partition (i.e. PartitionKey == 001) and tries to find the matching data there. If it does not find any data in that Partition, it does not know whether the data is there in another Partition so instead of going to the next Partition, it simply returns back with a continuation token and leave it to the client consuming the API to decide whether they want to continue the search using the same parameters + continuation token or revise their search to start again.

Cassandra batch select and batch update

I have a requirement to update all users with a specific value in a job.
i have million of users in my Cassandra database. is it okay to query million user first and do some kind of batch update? or is there some implementation available to do these kind of work. I am using hector API to interact with Cassandra. What can be the best possible way to do this.?
You never want to fetch 1 million users and keep them locally. Ideally you want to iterate over all those user keys using a range query. Hector calls this RangeSliceQuery. There is a good example here:
http://irfannagoo.wordpress.com/2013/02/27/hector-slice-query-options-with-cassandra/
For start and end key use null and add this also:
rangeQuery.setRowCount(100) to fetch 100 rows at a time.
Do this inside a loop. The first time you fetch with null being start and end key, the last key you get from the first result set should be the start key of your next query. And you continue paginating like this.
You can then use batch mutate and update in batches.
http://hector-client.github.io/hector/source/content/API/core/1.0-1/me/prettyprint/cassandra/service/BatchMutation.html

Resources