Fetching documents from DocumentDb using continuation token - azure

I have a DocumentDb database with a collection containing 37 documents with each around 150Kb each, now I use the following code snippet to fetch the documents using paging
var options = new FeedOptions
{
MaxItemCount = 100
};
var query = client.CreateDocumentQuery<Document>(collection, options).AsDocumentQuery();
while (query.HasMoreResults)
{
var result = query.ExecuteNextAsync<Document>().Result;
Console.WriteLine("Quota Usage: {0}", result.CurrentResourceQuotaUsage);
Console.WriteLine("Continuation Token: {0}", result.ResponseContinuation ?? "null");
var list = result.ToList();
Console.WriteLine("Document Count: {0}", list.Count);
}
The results I get are strange though
Quota Usage: documentSize=4;documentsSize=4386;documentsCount=37;collectionSize= 4395;
Continuation Token: null
Document Count: 36
I am not getting the last document ever as the continuation token becomes null after fetching 36 documents while document count does show up as 37. I am struggling to understand this behaviour and is there a way I can make this work to return all the documents.

Related

Query records from aws kendra index using sdk

I am testing out AWS Kendra for a business use-case and I am having trouble figuring out how to query data in the index to ensure data accuracy.
The connection where the data is coming from uses our Salesforce instance which contains over 1,000 knowledge articles.
The syncing of data appears to be working and we can see that the document count is 384.
Now, because we have over 1,000 possible articles, we have restricted our API user that is connecting Kendra to Salesforce to only be able to access specific articles.
Before we move forward, we want to ensure that the articles indexed are what we expect and have allowed the API user to bring over.
What I am now trying to do is audit / export the records that are in the index so I can compare them to the records we expect to see from the source.
For this, I am using the javascript SDK #aws-sdk/client-kendra.
I wrote a very basic test to try and query all of the records that had the same thing in common; _language_code.
Code Example:
const {
KendraClient,
QueryCommand
} = require("#aws-sdk/client-kendra");
const {
fromIni
} = require("#aws-sdk/credential-provider-ini");
const client = new KendraClient({
credentials: fromIni({
profile: 'ccs-account'
})
});
const fs = require('fs');
const index = "e65cacb1-5492-4760-84aa-7c6faa407455";
const pageSize = 100;
let currentPage = 1;
let totalResults;
let results = [];
/**
* Init
*/
const go = async () => {
let params = getParams(currentPage); // 1 works fine, 100 results returned. 2 returns 0 results
const command = new QueryCommand(params);
const response = await client.send(command);
totalResults = response.TotalNumberOfResults;
results = response.ResultItems;
// Write results to json
fs.writeFile('data.json', JSON.stringify(results, null, 4), (err) => {
if (err) throw err;
});
}
/**
* Get params for query
* #param {*} page
* #returns
*/
function getParams(page) {
return {
IndexId: index,
PageSize: pageSize,
PageNumber: page,
AttributeFilter: {
"EqualsTo": {
"Key": "_language_code",
"Value": {
"StringValue": "en"
}
}
},
SortingConfiguration: {
"DocumentAttributeKey": "_document_title",
"SortOrder": "ASC"
}
};
}
// Run
go();
The Problem / Question:
From what I can see in the documentation, the params seem to accept a PageNumber and PageSize which is an indication of paginated results.
When I query PageNumber=1 and PageSize=100, I get 100 records successfully as expected. Since the pagesize limit seems to be 100 results, my assumption would now be that I can change the PageNumber=2 and get the next 100 results. Repeating this process until I have retrieved the total records so I can QA the data.
I am at a loss as to why 0 records are returned when I target the second page as there should certainly be 3 pages of 100 results and 1 page of 84 results.
Any thoughts on what I am missing here? Is there a simpler way to export the indexed data to perform such analysis?
Thanks!
Please refer to the API documentation: https://docs.aws.amazon.com/kendra/latest/dg/API_Query.html
Each query returns the 100 most relevant results.
So you can't go to more than top 100 result by requesting second page. If you need to request more result, please request limit increase: https://docs.aws.amazon.com/kendra/latest/dg/quotas.html
Maximum number of search results per query. Default is 100. To enable more than 100 results, see Quotas Support

Cosmos DB Pagination giving multiplied page records

I have a scenario where I need to filter the collections based on the elements present in array inside documents. Can Anyone suggest how to use OFFSET and LIMIT with nested array in document
{
"id": "abcd",
"pqrs": 1,
"xyz": "UNKNOWN_594",
"arrayList": [
{
"Id": 2,
"def": true
},
{
"Id": 302,
"def": true
}
]
}
Now I need to filter and take 10 10 records from collections. I tried following query
SELECT * FROM collections c
WHERE ARRAY_CONTAINS(c.arrayList , {"Id":302 },true) or ARRAY_CONTAINS(c.arrayList , {"Id":2 },true)
ORDER BY c._ts DESC
OFFSET 10 LIMIT 10
Now when I run this query it is returning me 40 Records
At every step in next OFFSET, RU will go on increasing, instead you can use ContinuationToken
private static async Task QueryWithPagingAsync(Uri collectionUri)
{
// The .NET client automatically iterates through all the pages of query results
// Developers can explicitly control paging by creating an IDocumentQueryable
// using the IQueryable object, then by reading the ResponseContinuationToken values
// and passing them back as RequestContinuationToken in FeedOptions.
List<Family> families = new List<Family>();
// tell server we only want 1 record
FeedOptions options = new FeedOptions { MaxItemCount = 1, EnableCrossPartitionQuery = true };
// using AsDocumentQuery you get access to whether or not the query HasMoreResults
// If it does, just call ExecuteNextAsync until there are no more results
// No need to supply a continuation token here as the server keeps track of progress
var query = client.CreateDocumentQuery<Family>(collectionUri, options).AsDocumentQuery();
while (query.HasMoreResults)
{
foreach (Family family in await query.ExecuteNextAsync())
{
families.Add(family);
}
}
// The above sample works fine whilst in a loop as above, but
// what if you load a page of 1 record and then in a different
// Session at a later stage want to continue from where you were?
// well, now you need to capture the continuation token
// and use it on subsequent queries
query = client.CreateDocumentQuery<Family>(
collectionUri,
new FeedOptions { MaxItemCount = 1, EnableCrossPartitionQuery = true }).AsDocumentQuery();
var feedResponse = await query.ExecuteNextAsync<Family>();
string continuation = feedResponse.ResponseContinuation;
foreach (var f in feedResponse.AsEnumerable().OrderBy(f => f.Id))
{
}
// Now the second time around use the contiuation token you got
// and start the process from that point
query = client.CreateDocumentQuery<Family>(
collectionUri,
new FeedOptions
{
MaxItemCount = 1,
RequestContinuation = continuation,
EnableCrossPartitionQuery = true
}).AsDocumentQuery();
feedResponse = await query.ExecuteNextAsync<Family>();
foreach (var f in feedResponse.AsEnumerable().OrderBy(f => f.Id))
{
}
}
To skip through specific page, pfb the code
private static async Task QueryPageByPage(int currentPageNumber = 1, int documentNumber = 1)
{
// Number of documents per page
const int PAGE_SIZE = 3 // configurable;
// Continuation token for subsequent queries (NULL for the very first request/page)
string continuationToken = null;
do
{
Console.WriteLine($"----- PAGE {currentPageNumber} -----");
// Loads ALL documents for the current page
KeyValuePair<string, IEnumerable<Family>> currentPage = await QueryDocumentsByPage(currentPageNumber, PAGE_SIZE, continuationToken);
foreach (Family celeryTask in currentPage.Value)
{
documentNumber++;
}
// Ensure the continuation token is kept for the next page query execution
continuationToken = currentPage.Key;
currentPageNumber++;
} while (continuationToken != null);
Console.WriteLine("\n--- END: Finished Querying ALL Dcuments ---");
}
and QueryDocumentsByPage function as follows
private static async Task<KeyValuePair<string, IEnumerable<Family>>> QueryDocumentsByPage(int pageNumber, int pageSize, string continuationToken)
{
DocumentClient documentClient = new DocumentClient(new Uri("https://{CosmosDB/SQL Account Name}.documents.azure.com:443/"), "{CosmosDB/SQL Account Key}");
var feedOptions = new FeedOptions {
MaxItemCount = pageSize,
EnableCrossPartitionQuery = true,
// IMPORTANT: Set the continuation token (NULL for the first ever request/page)
RequestContinuation = continuationToken
};
IQueryable<Family> filter = documentClient.CreateDocumentQuery<Family>("dbs/{Database Name}/colls/{Collection Name}", feedOptions);
IDocumentQuery<Family> query = filter.AsDocumentQuery();
FeedResponse<Family> feedRespose = await query.ExecuteNextAsync<Family>();
List<Family> documents = new List<Family>();
foreach (CeleryTask t in feedRespose)
{
documents.Add(t);
}
// IMPORTANT: Ensure the continuation token is kept for the next requests
return new KeyValuePair<string, IEnumerable<Family>>(feedRespose.ResponseContinuation, documents);
}
Are you actually receiving 40 elements in the results? Or is it that you are getting back 10 documents but maybe your Cosmos itself has 40 documents for this query?
Using ORDER by clause retrieves all the documents based on the query, orders it in the DB and then applies the OFFSET and LIMIT values to deliver the final results.
I've illustrated this from the below snapshot.
My Cosmos account has 14 documents which match the query, this is
what matches the retrieved document count.
The output document is 10 because the DB had to skip the first 5 and
then deliver the next 5.
But my actual results are only 5 documents because that is what I
asked for.
Continuation tokens are efficient for paging but have limitations. They cannot be used if you directly want to skip pages(say jump from page 1 to page 10). You need to traverse through the pages from the first document and keep using the token to go to the next page. Due to the limitations, it is usually recommended if you have a large number of documents for a single query.
Another recommendation is to use indexing to improve your RU/s usage when using ORDER BY. See this link.

SELECT VALUE COUNT(1) FROM (SELECT DISTINCT c.UserId FROM root c) AS t not working

In a Cosmos DB stored procedure, I'm using a inline sql query to try and retrieve the distinct count of a particular user id.
I'm using the SQL API for my account. I've run the below query in Query Explorer in my Cosmos DB account and I know that I should get a count of 10 (There are 10 unique user ids in my collection):
SELECT VALUE COUNT(1) FROM (SELECT DISTINCT c.UserId FROM root c) AS t
However when I run this in the Stored Procedure portal, I either get 0 records back or 18 records back (total number of documents). The code for my Stored Procedure is as follows:
function GetDistinctCount() {
var collection = getContext().getCollection();
var isAccepted = collection.queryDocuments(
collection.getSelfLink(),
'SELECT VALUE COUNT(1) FROM (SELECT DISTINCT c.UserId FROM root c) AS t',
function(err, feed, options) {
if (err) throw err;
if (!feed || !feed.length) {
var response = getContext().getResponse();
var body = {code: 404, body: "no docs found"}
response.setBody(JSON.stringify(body));
} else {
var response = getContext().getResponse();
var body = {code: 200, body: feed[0]}
response.setBody(JSON.stringify(body));
}
}
)
}
After looking at various feedback forums and documentation, I don't think there's an elegant solution for me to do this as simply as it would be in normal SQL.
the UserId is my partition key which I'm passing through in my C# code and when I test it in the portal, so there's no additional parameters that I need to set when calling the Stored Proc. I'm calling this Stored Proc via C# and adding any further parameters will have an effect on my tests for that code, so I'm keen not to introduce any parameters if I can.
Your problem is caused by that you missed setting partition key for your stored procedure.
Please see the statements in the official document:
And this:
So,when you execute a stored procedure under a partitioned collection, you need to pass the partition key param. It's necessary! (Also this case explained this:Documentdb stored proc cross partition query)
Back to your question,you never pass any partition key, equals you pass an null value or "" value for partition key, so it outputs no data because you don't have any userId equals null or "".
My advice:
You could use normal Query SDK to execute your sql, and set the enableCrossPartitionQuery: true which allows you scan entire collection without setting partition key. Please refer to this tiny sample:Can't get simple CosmosDB query to work via Node.js - but works fine via Azure's Query Explorer
So I found a solution that returns the result I need. My stored procedure now looks like this:
function GetPaymentCount() {
var collection = getContext().getCollection();
var isAccepted = collection.queryDocuments(
collection.getSelfLink(),
'SELECT DISTINCT VALUE(doc.UserId) from root doc' ,
{pageSize:-1 },
function(err, feed, options) {
if (err) throw err;
if (!feed || !feed.length) {
var response = getContext().getResponse();
var body = {code: 404, body: "no docs found"}
response.setBody(JSON.stringify(body));
} else {
var response = getContext().getResponse();
var body = {code: 200, body: JSON.stringify(feed.length)}
response.setBody(JSON.stringify(body));
}
}
)
}
Essentially, I changed the pageSize parameter to -1 which returned all the documents I knew would be returned in the result. I have a feeling that this will be more expensive in terms of RU/s cost, but it solves my case for now.
If anyone has more efficient alternatives, please comment and let me know.

Azure CosmosDB: stored procedure delete documents based on query

The goal is to input a simple string query like
SELECT *
FROM c
WHERE c.deviceId = "device1"
and all resulting fetched documents need to be deleted.
I have found very old posts about doing this with a stored procedure, but I can't get it to work properly with the "new" UI.
Thanks a lot in advance.
EDIT: I feel like #jay-gong pointed to the correct direction but I encountered a problem with his solution:
I can correctly create the stored procedure but when I try to execute it it asks for the partition key, which I give but after executing, it doesn't delete any document.
The collection just has a few documents and its partition key is /message/id which is what I wrote in the partition key field.
Since cosmos db does not support deleting documents by SQL (Delete SQL for CosmosDB), you could query the documents and delete them by Delete SDK one by one. Or you could choose bulk operation in stored procedure.
You could totally follow the stored procedure bulk delete sample code to implement your requirements which works for me.
function bulkDeleteProcedure(query) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var response = getContext().getResponse();
var responseBody = {
deleted: 0,
continuation: true
};
query = 'SELECT * FROM c WHERE c.deviceId="device1"';
// Validate input.
if (!query) throw new Error("The query is undefined or null.");
tryQueryAndDelete();
// Recursively runs the query w/ support for continuation tokens.
// Calls tryDelete(documents) as soon as the query returns documents.
function tryQueryAndDelete(continuation) {
var requestOptions = {continuation: continuation};
var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, retrievedDocs, responseOptions) {
if (err) throw err;
if (retrievedDocs.length > 0) {
// Begin deleting documents as soon as documents are returned form the query results.
// tryDelete() resumes querying after deleting; no need to page through continuation tokens.
// - this is to prioritize writes over reads given timeout constraints.
tryDelete(retrievedDocs);
} else if (responseOptions.continuation) {
// Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
tryQueryAndDelete(responseOptions.continuation);
} else {
// Else if there are no more documents and no continuation token - we are finished deleting documents.
responseBody.continuation = false;
response.setBody(responseBody);
}
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
}
// Recursively deletes documents passed in as an array argument.
// Attempts to query for more on empty array.
function tryDelete(documents) {
if (documents.length > 0) {
// Delete the first document in the array.
var isAccepted = collection.deleteDocument(documents[0]._self, {}, function (err, responseOptions) {
if (err) throw err;
responseBody.deleted++;
documents.shift();
// Delete the next document in the array.
tryDelete(documents);
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
} else {
// If the document array is empty, query for more documents.
tryQueryAndDelete();
}
}
}
Furthermore, as I know, stored procedure has 5 seconds execute limitation. If you crash into the time out error, you could pass the continuation token as parameter into stored procedure and execute stored procedure several times.
Update Answer:
Partition key is necessary for the partitioned collection in the stored procedure.(Please refer to the detailed explanation :Azure Cosmos DB asking for partition key for stored procedure.)
So, firstly,above code needs your partition key.For example, your partition key is defined as /message/id and your data as below:
{
"message":{
"id":"1"
}
}
Then you need to pass the pk as message/1.
Obviously,your query sql crosses partitions,I suggest you adopt http trigger azure function instead of stored procedure.In that function,you could use cosmos db sdk code to do the query and delete operations.Don't forget set the EnableCrossPartitionQuery to true. Please refer to this case:Azure Cosmos DB asking for partition key for stored procedure.

Cosmos DB - Deleting a document

How can I delete an individual record from Cosmos DB?
I can select using SQL syntax:
SELECT *
FROM collection1
WHERE (collection1._ts > 0)
And sure enough all documents (analogous to rows?) are returned
However this doesn't work when I attempt to delete
DELETE
FROM collection1
WHERE (collection1._ts > 0)
How do I achieve that?
The DocumentDB API's SQL is specifically for querying. That is, it only provides SELECT, not UPDATE or DELETE.
Those operations are fully supported, but require REST (or SDK) calls. For example, with .net, you'd call DeleteDocumentAsync() or ReplaceDocumentAsync(), and in node.js, this would be a call to deleteDocument() or replaceDocument().
In your particular scenario, you could run your SELECT to identify documents for deletion, then make "delete" calls, one per document (or, for efficiency and transactionality, pass an array of documents to delete, into a stored procedure).
The easiest way is probably by using Azure Storage Explorer. After connecting you can drill down to a container of choice, select a document and then delete it. You can find additional tools for Cosmos DB on https://gotcosmos.com/tools.
Another option to consider is the time to live (TTL). You can turn this on for a collection and then set an expiration for the documents. The documents will be cleaned up automatically for you as they expire.
Create a stored procedure with the following code:
/**
* A Cosmos DB stored procedure that bulk deletes documents for a given query.
* Note: You may need to execute this stored procedure multiple times (depending whether the stored procedure is able to delete every document within the execution timeout limit).
*
* #function
* #param {string} query - A query that provides the documents to be deleted (e.g. "SELECT c._self FROM c WHERE c.founded_year = 2008"). Note: For best performance, reduce the # of properties returned per document in the query to only what's required (e.g. prefer SELECT c._self over SELECT * )
* #returns {Object.<number, boolean>} Returns an object with the two properties:
* deleted - contains a count of documents deleted
* continuation - a boolean whether you should execute the stored procedure again (true if there are more documents to delete; false otherwise).
*/
function bulkDeleteStoredProcedure(query) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var response = getContext().getResponse();
var responseBody = {
deleted: 0,
continuation: true
};
// Validate input.
if (!query) throw new Error("The query is undefined or null.");
tryQueryAndDelete();
// Recursively runs the query w/ support for continuation tokens.
// Calls tryDelete(documents) as soon as the query returns documents.
function tryQueryAndDelete(continuation) {
var requestOptions = {continuation: continuation};
var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, retrievedDocs, responseOptions) {
if (err) throw err;
if (retrievedDocs.length > 0) {
// Begin deleting documents as soon as documents are returned form the query results.
// tryDelete() resumes querying after deleting; no need to page through continuation tokens.
// - this is to prioritize writes over reads given timeout constraints.
tryDelete(retrievedDocs);
} else if (responseOptions.continuation) {
// Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
tryQueryAndDelete(responseOptions.continuation);
} else {
// Else if there are no more documents and no continuation token - we are finished deleting documents.
responseBody.continuation = false;
response.setBody(responseBody);
}
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
}
// Recursively deletes documents passed in as an array argument.
// Attempts to query for more on empty array.
function tryDelete(documents) {
if (documents.length > 0) {
// Delete the first document in the array.
var isAccepted = collection.deleteDocument(documents[0]._self, {}, function (err, responseOptions) {
if (err) throw err;
responseBody.deleted++;
documents.shift();
// Delete the next document in the array.
tryDelete(documents);
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
} else {
// If the document array is empty, query for more documents.
tryQueryAndDelete();
}
}
}
And execute it using your partition key (example: null) and a query to select the documents (example: SELECT c._self FROM c to delete all).
Based on Delete Documents from CosmosDB based on condition through Query Explorer
Here is an example of how to use bulkDeleteStoredProcedure using .net Cosmos SDK V3.
ContinuationFlag has to be used because of the execution bounds.
private async Task<int> ExecuteSpBulkDelete(string query, string partitionKey)
{
var continuationFlag = true;
var totalDeleted = 0;
while (continuationFlag)
{
StoredProcedureExecuteResponse<BulkDeleteResponse> result = await _container.Scripts.ExecuteStoredProcedureAsync<BulkDeleteResponse>(
"spBulkDelete", // your sproc name
new PartitionKey(partitionKey), // pk value
new[] { sql });
var response = result.Resource;
continuationFlag = response.Continuation;
var deleted = response.Deleted;
totalDeleted += deleted;
Console.WriteLine($"Deleted {deleted} documents ({totalDeleted} total, more: {continuationFlag}, used {result.RequestCharge}RUs)");
}
return totalDeleted;
}
and response model:
public class BulkDeleteResponse
{
[JsonProperty("deleted")]
public int Deleted { get; set; }
[JsonProperty("continuation")]
public bool Continuation { get; set; }
}

Resources