Cosmos DB Pagination giving multiplied page records - azure

I have a scenario where I need to filter the collections based on the elements present in array inside documents. Can Anyone suggest how to use OFFSET and LIMIT with nested array in document
{
"id": "abcd",
"pqrs": 1,
"xyz": "UNKNOWN_594",
"arrayList": [
{
"Id": 2,
"def": true
},
{
"Id": 302,
"def": true
}
]
}
Now I need to filter and take 10 10 records from collections. I tried following query
SELECT * FROM collections c
WHERE ARRAY_CONTAINS(c.arrayList , {"Id":302 },true) or ARRAY_CONTAINS(c.arrayList , {"Id":2 },true)
ORDER BY c._ts DESC
OFFSET 10 LIMIT 10
Now when I run this query it is returning me 40 Records

At every step in next OFFSET, RU will go on increasing, instead you can use ContinuationToken
private static async Task QueryWithPagingAsync(Uri collectionUri)
{
// The .NET client automatically iterates through all the pages of query results
// Developers can explicitly control paging by creating an IDocumentQueryable
// using the IQueryable object, then by reading the ResponseContinuationToken values
// and passing them back as RequestContinuationToken in FeedOptions.
List<Family> families = new List<Family>();
// tell server we only want 1 record
FeedOptions options = new FeedOptions { MaxItemCount = 1, EnableCrossPartitionQuery = true };
// using AsDocumentQuery you get access to whether or not the query HasMoreResults
// If it does, just call ExecuteNextAsync until there are no more results
// No need to supply a continuation token here as the server keeps track of progress
var query = client.CreateDocumentQuery<Family>(collectionUri, options).AsDocumentQuery();
while (query.HasMoreResults)
{
foreach (Family family in await query.ExecuteNextAsync())
{
families.Add(family);
}
}
// The above sample works fine whilst in a loop as above, but
// what if you load a page of 1 record and then in a different
// Session at a later stage want to continue from where you were?
// well, now you need to capture the continuation token
// and use it on subsequent queries
query = client.CreateDocumentQuery<Family>(
collectionUri,
new FeedOptions { MaxItemCount = 1, EnableCrossPartitionQuery = true }).AsDocumentQuery();
var feedResponse = await query.ExecuteNextAsync<Family>();
string continuation = feedResponse.ResponseContinuation;
foreach (var f in feedResponse.AsEnumerable().OrderBy(f => f.Id))
{
}
// Now the second time around use the contiuation token you got
// and start the process from that point
query = client.CreateDocumentQuery<Family>(
collectionUri,
new FeedOptions
{
MaxItemCount = 1,
RequestContinuation = continuation,
EnableCrossPartitionQuery = true
}).AsDocumentQuery();
feedResponse = await query.ExecuteNextAsync<Family>();
foreach (var f in feedResponse.AsEnumerable().OrderBy(f => f.Id))
{
}
}
To skip through specific page, pfb the code
private static async Task QueryPageByPage(int currentPageNumber = 1, int documentNumber = 1)
{
// Number of documents per page
const int PAGE_SIZE = 3 // configurable;
// Continuation token for subsequent queries (NULL for the very first request/page)
string continuationToken = null;
do
{
Console.WriteLine($"----- PAGE {currentPageNumber} -----");
// Loads ALL documents for the current page
KeyValuePair<string, IEnumerable<Family>> currentPage = await QueryDocumentsByPage(currentPageNumber, PAGE_SIZE, continuationToken);
foreach (Family celeryTask in currentPage.Value)
{
documentNumber++;
}
// Ensure the continuation token is kept for the next page query execution
continuationToken = currentPage.Key;
currentPageNumber++;
} while (continuationToken != null);
Console.WriteLine("\n--- END: Finished Querying ALL Dcuments ---");
}
and QueryDocumentsByPage function as follows
private static async Task<KeyValuePair<string, IEnumerable<Family>>> QueryDocumentsByPage(int pageNumber, int pageSize, string continuationToken)
{
DocumentClient documentClient = new DocumentClient(new Uri("https://{CosmosDB/SQL Account Name}.documents.azure.com:443/"), "{CosmosDB/SQL Account Key}");
var feedOptions = new FeedOptions {
MaxItemCount = pageSize,
EnableCrossPartitionQuery = true,
// IMPORTANT: Set the continuation token (NULL for the first ever request/page)
RequestContinuation = continuationToken
};
IQueryable<Family> filter = documentClient.CreateDocumentQuery<Family>("dbs/{Database Name}/colls/{Collection Name}", feedOptions);
IDocumentQuery<Family> query = filter.AsDocumentQuery();
FeedResponse<Family> feedRespose = await query.ExecuteNextAsync<Family>();
List<Family> documents = new List<Family>();
foreach (CeleryTask t in feedRespose)
{
documents.Add(t);
}
// IMPORTANT: Ensure the continuation token is kept for the next requests
return new KeyValuePair<string, IEnumerable<Family>>(feedRespose.ResponseContinuation, documents);
}

Are you actually receiving 40 elements in the results? Or is it that you are getting back 10 documents but maybe your Cosmos itself has 40 documents for this query?
Using ORDER by clause retrieves all the documents based on the query, orders it in the DB and then applies the OFFSET and LIMIT values to deliver the final results.
I've illustrated this from the below snapshot.
My Cosmos account has 14 documents which match the query, this is
what matches the retrieved document count.
The output document is 10 because the DB had to skip the first 5 and
then deliver the next 5.
But my actual results are only 5 documents because that is what I
asked for.
Continuation tokens are efficient for paging but have limitations. They cannot be used if you directly want to skip pages(say jump from page 1 to page 10). You need to traverse through the pages from the first document and keep using the token to go to the next page. Due to the limitations, it is usually recommended if you have a large number of documents for a single query.
Another recommendation is to use indexing to improve your RU/s usage when using ORDER BY. See this link.

Related

Query records from aws kendra index using sdk

I am testing out AWS Kendra for a business use-case and I am having trouble figuring out how to query data in the index to ensure data accuracy.
The connection where the data is coming from uses our Salesforce instance which contains over 1,000 knowledge articles.
The syncing of data appears to be working and we can see that the document count is 384.
Now, because we have over 1,000 possible articles, we have restricted our API user that is connecting Kendra to Salesforce to only be able to access specific articles.
Before we move forward, we want to ensure that the articles indexed are what we expect and have allowed the API user to bring over.
What I am now trying to do is audit / export the records that are in the index so I can compare them to the records we expect to see from the source.
For this, I am using the javascript SDK #aws-sdk/client-kendra.
I wrote a very basic test to try and query all of the records that had the same thing in common; _language_code.
Code Example:
const {
KendraClient,
QueryCommand
} = require("#aws-sdk/client-kendra");
const {
fromIni
} = require("#aws-sdk/credential-provider-ini");
const client = new KendraClient({
credentials: fromIni({
profile: 'ccs-account'
})
});
const fs = require('fs');
const index = "e65cacb1-5492-4760-84aa-7c6faa407455";
const pageSize = 100;
let currentPage = 1;
let totalResults;
let results = [];
/**
* Init
*/
const go = async () => {
let params = getParams(currentPage); // 1 works fine, 100 results returned. 2 returns 0 results
const command = new QueryCommand(params);
const response = await client.send(command);
totalResults = response.TotalNumberOfResults;
results = response.ResultItems;
// Write results to json
fs.writeFile('data.json', JSON.stringify(results, null, 4), (err) => {
if (err) throw err;
});
}
/**
* Get params for query
* #param {*} page
* #returns
*/
function getParams(page) {
return {
IndexId: index,
PageSize: pageSize,
PageNumber: page,
AttributeFilter: {
"EqualsTo": {
"Key": "_language_code",
"Value": {
"StringValue": "en"
}
}
},
SortingConfiguration: {
"DocumentAttributeKey": "_document_title",
"SortOrder": "ASC"
}
};
}
// Run
go();
The Problem / Question:
From what I can see in the documentation, the params seem to accept a PageNumber and PageSize which is an indication of paginated results.
When I query PageNumber=1 and PageSize=100, I get 100 records successfully as expected. Since the pagesize limit seems to be 100 results, my assumption would now be that I can change the PageNumber=2 and get the next 100 results. Repeating this process until I have retrieved the total records so I can QA the data.
I am at a loss as to why 0 records are returned when I target the second page as there should certainly be 3 pages of 100 results and 1 page of 84 results.
Any thoughts on what I am missing here? Is there a simpler way to export the indexed data to perform such analysis?
Thanks!
Please refer to the API documentation: https://docs.aws.amazon.com/kendra/latest/dg/API_Query.html
Each query returns the 100 most relevant results.
So you can't go to more than top 100 result by requesting second page. If you need to request more result, please request limit increase: https://docs.aws.amazon.com/kendra/latest/dg/quotas.html
Maximum number of search results per query. Default is 100. To enable more than 100 results, see Quotas Support

how to get one page data list and total count from database with knex.js?

I have a user table with some records(such as 100), how can I get one page data and total count from it when there are some where conditions?
I tried the following:
var model = knex.table('users').where('status',1).where('online', 1)
var totalCount = await model.count();
var data = model.offset(0).limit(10).select()
return {
totalCount: totalCount[0]['count']
data: data
}
but I get
{
"totalCount": "11",
"data": [
{
"count": "11"
}
]
}
, how can I get dataList without write where twice? I don't want to do like this:
var totalCount = await knex.table('users').where('status',1).where('online', 1).count();
var data = await knex.table('users').where('status',1).where('online', 1).offset(0).limit(10).select()
return {
totalCount: totalCount[0]['count']
data: data
}
Thank you :)
You probably should use higher level library like Objection.js which has already convenience method for getting pages and total count.
You can do it like this with knex:
// query builder is mutable so when you call method it will change builders internal state
const query = knex('users').where('status',1).where('online', 1);
// by cloning original query you can reuse common parts of the query
const total = await query.clone().count();
const data = await query.clone().offset(0).limit(10);

Fetching documents from DocumentDb using continuation token

I have a DocumentDb database with a collection containing 37 documents with each around 150Kb each, now I use the following code snippet to fetch the documents using paging
var options = new FeedOptions
{
MaxItemCount = 100
};
var query = client.CreateDocumentQuery<Document>(collection, options).AsDocumentQuery();
while (query.HasMoreResults)
{
var result = query.ExecuteNextAsync<Document>().Result;
Console.WriteLine("Quota Usage: {0}", result.CurrentResourceQuotaUsage);
Console.WriteLine("Continuation Token: {0}", result.ResponseContinuation ?? "null");
var list = result.ToList();
Console.WriteLine("Document Count: {0}", list.Count);
}
The results I get are strange though
Quota Usage: documentSize=4;documentsSize=4386;documentsCount=37;collectionSize= 4395;
Continuation Token: null
Document Count: 36
I am not getting the last document ever as the continuation token becomes null after fetching 36 documents while document count does show up as 37. I am struggling to understand this behaviour and is there a way I can make this work to return all the documents.

Cosmos DB: range string index return query error

I'm trying to create custom index policy on CosmosDB documents collection for include only 1 field in index.
The index policy looks like :
new IndexingPolicy
{
Automatic = true,
IndexingMode = IndexingMode.Consistent,
ExcludedPaths = new Collection<ExcludedPath>(new[]
{
new ExcludedPath { Path = "/*" }
}),
IncludedPaths = new Collection<IncludedPath>(new[]
{
new IncludedPath
{
Path = "/Id/?",
Indexes = new Collection<Index>(new Index[] { new RangeIndex(DataType.String) {Precision = -1 } })
}
})
};
Then I do query on documents collection :
CosmosClient.CreateDocumentQuery<Entity>(
CollectionUri(docCollection),
"SELECT x from x where x.Id != \"\" ",
new FeedOptions
{
MaxItemCount = 100,
RequestContinuation = null,
EnableCrossPartitionQuery = false,
PartitionKey = new PartitionKey("entities"),
}).AsDocumentQuery();
Such request throws an error: An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request.
While almost same one (check for equality instead of unequality) gives correct result.
Did I configure index policy wrong or I need to specify some additional args when querying? Thanks
Your partition key path should also be included in the included paths. It's implicitly included as a filter because you're setting it in PartitionKey = new PartitionKey("entities").

Cosmos DB - Deleting a document

How can I delete an individual record from Cosmos DB?
I can select using SQL syntax:
SELECT *
FROM collection1
WHERE (collection1._ts > 0)
And sure enough all documents (analogous to rows?) are returned
However this doesn't work when I attempt to delete
DELETE
FROM collection1
WHERE (collection1._ts > 0)
How do I achieve that?
The DocumentDB API's SQL is specifically for querying. That is, it only provides SELECT, not UPDATE or DELETE.
Those operations are fully supported, but require REST (or SDK) calls. For example, with .net, you'd call DeleteDocumentAsync() or ReplaceDocumentAsync(), and in node.js, this would be a call to deleteDocument() or replaceDocument().
In your particular scenario, you could run your SELECT to identify documents for deletion, then make "delete" calls, one per document (or, for efficiency and transactionality, pass an array of documents to delete, into a stored procedure).
The easiest way is probably by using Azure Storage Explorer. After connecting you can drill down to a container of choice, select a document and then delete it. You can find additional tools for Cosmos DB on https://gotcosmos.com/tools.
Another option to consider is the time to live (TTL). You can turn this on for a collection and then set an expiration for the documents. The documents will be cleaned up automatically for you as they expire.
Create a stored procedure with the following code:
/**
* A Cosmos DB stored procedure that bulk deletes documents for a given query.
* Note: You may need to execute this stored procedure multiple times (depending whether the stored procedure is able to delete every document within the execution timeout limit).
*
* #function
* #param {string} query - A query that provides the documents to be deleted (e.g. "SELECT c._self FROM c WHERE c.founded_year = 2008"). Note: For best performance, reduce the # of properties returned per document in the query to only what's required (e.g. prefer SELECT c._self over SELECT * )
* #returns {Object.<number, boolean>} Returns an object with the two properties:
* deleted - contains a count of documents deleted
* continuation - a boolean whether you should execute the stored procedure again (true if there are more documents to delete; false otherwise).
*/
function bulkDeleteStoredProcedure(query) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var response = getContext().getResponse();
var responseBody = {
deleted: 0,
continuation: true
};
// Validate input.
if (!query) throw new Error("The query is undefined or null.");
tryQueryAndDelete();
// Recursively runs the query w/ support for continuation tokens.
// Calls tryDelete(documents) as soon as the query returns documents.
function tryQueryAndDelete(continuation) {
var requestOptions = {continuation: continuation};
var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, retrievedDocs, responseOptions) {
if (err) throw err;
if (retrievedDocs.length > 0) {
// Begin deleting documents as soon as documents are returned form the query results.
// tryDelete() resumes querying after deleting; no need to page through continuation tokens.
// - this is to prioritize writes over reads given timeout constraints.
tryDelete(retrievedDocs);
} else if (responseOptions.continuation) {
// Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
tryQueryAndDelete(responseOptions.continuation);
} else {
// Else if there are no more documents and no continuation token - we are finished deleting documents.
responseBody.continuation = false;
response.setBody(responseBody);
}
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
}
// Recursively deletes documents passed in as an array argument.
// Attempts to query for more on empty array.
function tryDelete(documents) {
if (documents.length > 0) {
// Delete the first document in the array.
var isAccepted = collection.deleteDocument(documents[0]._self, {}, function (err, responseOptions) {
if (err) throw err;
responseBody.deleted++;
documents.shift();
// Delete the next document in the array.
tryDelete(documents);
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
} else {
// If the document array is empty, query for more documents.
tryQueryAndDelete();
}
}
}
And execute it using your partition key (example: null) and a query to select the documents (example: SELECT c._self FROM c to delete all).
Based on Delete Documents from CosmosDB based on condition through Query Explorer
Here is an example of how to use bulkDeleteStoredProcedure using .net Cosmos SDK V3.
ContinuationFlag has to be used because of the execution bounds.
private async Task<int> ExecuteSpBulkDelete(string query, string partitionKey)
{
var continuationFlag = true;
var totalDeleted = 0;
while (continuationFlag)
{
StoredProcedureExecuteResponse<BulkDeleteResponse> result = await _container.Scripts.ExecuteStoredProcedureAsync<BulkDeleteResponse>(
"spBulkDelete", // your sproc name
new PartitionKey(partitionKey), // pk value
new[] { sql });
var response = result.Resource;
continuationFlag = response.Continuation;
var deleted = response.Deleted;
totalDeleted += deleted;
Console.WriteLine($"Deleted {deleted} documents ({totalDeleted} total, more: {continuationFlag}, used {result.RequestCharge}RUs)");
}
return totalDeleted;
}
and response model:
public class BulkDeleteResponse
{
[JsonProperty("deleted")]
public int Deleted { get; set; }
[JsonProperty("continuation")]
public bool Continuation { get; set; }
}

Resources