Query records from aws kendra index using sdk - node.js

I am testing out AWS Kendra for a business use-case and I am having trouble figuring out how to query data in the index to ensure data accuracy.
The connection where the data is coming from uses our Salesforce instance which contains over 1,000 knowledge articles.
The syncing of data appears to be working and we can see that the document count is 384.
Now, because we have over 1,000 possible articles, we have restricted our API user that is connecting Kendra to Salesforce to only be able to access specific articles.
Before we move forward, we want to ensure that the articles indexed are what we expect and have allowed the API user to bring over.
What I am now trying to do is audit / export the records that are in the index so I can compare them to the records we expect to see from the source.
For this, I am using the javascript SDK #aws-sdk/client-kendra.
I wrote a very basic test to try and query all of the records that had the same thing in common; _language_code.
Code Example:
const {
KendraClient,
QueryCommand
} = require("#aws-sdk/client-kendra");
const {
fromIni
} = require("#aws-sdk/credential-provider-ini");
const client = new KendraClient({
credentials: fromIni({
profile: 'ccs-account'
})
});
const fs = require('fs');
const index = "e65cacb1-5492-4760-84aa-7c6faa407455";
const pageSize = 100;
let currentPage = 1;
let totalResults;
let results = [];
/**
* Init
*/
const go = async () => {
let params = getParams(currentPage); // 1 works fine, 100 results returned. 2 returns 0 results
const command = new QueryCommand(params);
const response = await client.send(command);
totalResults = response.TotalNumberOfResults;
results = response.ResultItems;
// Write results to json
fs.writeFile('data.json', JSON.stringify(results, null, 4), (err) => {
if (err) throw err;
});
}
/**
* Get params for query
* #param {*} page
* #returns
*/
function getParams(page) {
return {
IndexId: index,
PageSize: pageSize,
PageNumber: page,
AttributeFilter: {
"EqualsTo": {
"Key": "_language_code",
"Value": {
"StringValue": "en"
}
}
},
SortingConfiguration: {
"DocumentAttributeKey": "_document_title",
"SortOrder": "ASC"
}
};
}
// Run
go();
The Problem / Question:
From what I can see in the documentation, the params seem to accept a PageNumber and PageSize which is an indication of paginated results.
When I query PageNumber=1 and PageSize=100, I get 100 records successfully as expected. Since the pagesize limit seems to be 100 results, my assumption would now be that I can change the PageNumber=2 and get the next 100 results. Repeating this process until I have retrieved the total records so I can QA the data.
I am at a loss as to why 0 records are returned when I target the second page as there should certainly be 3 pages of 100 results and 1 page of 84 results.
Any thoughts on what I am missing here? Is there a simpler way to export the indexed data to perform such analysis?
Thanks!

Please refer to the API documentation: https://docs.aws.amazon.com/kendra/latest/dg/API_Query.html
Each query returns the 100 most relevant results.
So you can't go to more than top 100 result by requesting second page. If you need to request more result, please request limit increase: https://docs.aws.amazon.com/kendra/latest/dg/quotas.html
Maximum number of search results per query. Default is 100. To enable more than 100 results, see Quotas Support

Related

Cosmos DB Pagination giving multiplied page records

I have a scenario where I need to filter the collections based on the elements present in array inside documents. Can Anyone suggest how to use OFFSET and LIMIT with nested array in document
{
"id": "abcd",
"pqrs": 1,
"xyz": "UNKNOWN_594",
"arrayList": [
{
"Id": 2,
"def": true
},
{
"Id": 302,
"def": true
}
]
}
Now I need to filter and take 10 10 records from collections. I tried following query
SELECT * FROM collections c
WHERE ARRAY_CONTAINS(c.arrayList , {"Id":302 },true) or ARRAY_CONTAINS(c.arrayList , {"Id":2 },true)
ORDER BY c._ts DESC
OFFSET 10 LIMIT 10
Now when I run this query it is returning me 40 Records
At every step in next OFFSET, RU will go on increasing, instead you can use ContinuationToken
private static async Task QueryWithPagingAsync(Uri collectionUri)
{
// The .NET client automatically iterates through all the pages of query results
// Developers can explicitly control paging by creating an IDocumentQueryable
// using the IQueryable object, then by reading the ResponseContinuationToken values
// and passing them back as RequestContinuationToken in FeedOptions.
List<Family> families = new List<Family>();
// tell server we only want 1 record
FeedOptions options = new FeedOptions { MaxItemCount = 1, EnableCrossPartitionQuery = true };
// using AsDocumentQuery you get access to whether or not the query HasMoreResults
// If it does, just call ExecuteNextAsync until there are no more results
// No need to supply a continuation token here as the server keeps track of progress
var query = client.CreateDocumentQuery<Family>(collectionUri, options).AsDocumentQuery();
while (query.HasMoreResults)
{
foreach (Family family in await query.ExecuteNextAsync())
{
families.Add(family);
}
}
// The above sample works fine whilst in a loop as above, but
// what if you load a page of 1 record and then in a different
// Session at a later stage want to continue from where you were?
// well, now you need to capture the continuation token
// and use it on subsequent queries
query = client.CreateDocumentQuery<Family>(
collectionUri,
new FeedOptions { MaxItemCount = 1, EnableCrossPartitionQuery = true }).AsDocumentQuery();
var feedResponse = await query.ExecuteNextAsync<Family>();
string continuation = feedResponse.ResponseContinuation;
foreach (var f in feedResponse.AsEnumerable().OrderBy(f => f.Id))
{
}
// Now the second time around use the contiuation token you got
// and start the process from that point
query = client.CreateDocumentQuery<Family>(
collectionUri,
new FeedOptions
{
MaxItemCount = 1,
RequestContinuation = continuation,
EnableCrossPartitionQuery = true
}).AsDocumentQuery();
feedResponse = await query.ExecuteNextAsync<Family>();
foreach (var f in feedResponse.AsEnumerable().OrderBy(f => f.Id))
{
}
}
To skip through specific page, pfb the code
private static async Task QueryPageByPage(int currentPageNumber = 1, int documentNumber = 1)
{
// Number of documents per page
const int PAGE_SIZE = 3 // configurable;
// Continuation token for subsequent queries (NULL for the very first request/page)
string continuationToken = null;
do
{
Console.WriteLine($"----- PAGE {currentPageNumber} -----");
// Loads ALL documents for the current page
KeyValuePair<string, IEnumerable<Family>> currentPage = await QueryDocumentsByPage(currentPageNumber, PAGE_SIZE, continuationToken);
foreach (Family celeryTask in currentPage.Value)
{
documentNumber++;
}
// Ensure the continuation token is kept for the next page query execution
continuationToken = currentPage.Key;
currentPageNumber++;
} while (continuationToken != null);
Console.WriteLine("\n--- END: Finished Querying ALL Dcuments ---");
}
and QueryDocumentsByPage function as follows
private static async Task<KeyValuePair<string, IEnumerable<Family>>> QueryDocumentsByPage(int pageNumber, int pageSize, string continuationToken)
{
DocumentClient documentClient = new DocumentClient(new Uri("https://{CosmosDB/SQL Account Name}.documents.azure.com:443/"), "{CosmosDB/SQL Account Key}");
var feedOptions = new FeedOptions {
MaxItemCount = pageSize,
EnableCrossPartitionQuery = true,
// IMPORTANT: Set the continuation token (NULL for the first ever request/page)
RequestContinuation = continuationToken
};
IQueryable<Family> filter = documentClient.CreateDocumentQuery<Family>("dbs/{Database Name}/colls/{Collection Name}", feedOptions);
IDocumentQuery<Family> query = filter.AsDocumentQuery();
FeedResponse<Family> feedRespose = await query.ExecuteNextAsync<Family>();
List<Family> documents = new List<Family>();
foreach (CeleryTask t in feedRespose)
{
documents.Add(t);
}
// IMPORTANT: Ensure the continuation token is kept for the next requests
return new KeyValuePair<string, IEnumerable<Family>>(feedRespose.ResponseContinuation, documents);
}
Are you actually receiving 40 elements in the results? Or is it that you are getting back 10 documents but maybe your Cosmos itself has 40 documents for this query?
Using ORDER by clause retrieves all the documents based on the query, orders it in the DB and then applies the OFFSET and LIMIT values to deliver the final results.
I've illustrated this from the below snapshot.
My Cosmos account has 14 documents which match the query, this is
what matches the retrieved document count.
The output document is 10 because the DB had to skip the first 5 and
then deliver the next 5.
But my actual results are only 5 documents because that is what I
asked for.
Continuation tokens are efficient for paging but have limitations. They cannot be used if you directly want to skip pages(say jump from page 1 to page 10). You need to traverse through the pages from the first document and keep using the token to go to the next page. Due to the limitations, it is usually recommended if you have a large number of documents for a single query.
Another recommendation is to use indexing to improve your RU/s usage when using ORDER BY. See this link.

Firestore - How to find the count of documents that can be fetched using a query without really fetching all documents? [duplicate]

This question already has answers here:
Cloud Firestore collection count
(29 answers)
Closed 2 years ago.
Actually I want to create a Admin Panel for my website and I need to show the total number of users that signed up today(or in the last week). I want to fetch the number of user documents that were created in the users collection today. I have a createdOn field in every user document.
Can I do something like this:
admin.firestore()
.collection('user')
.where('createdOn', '>', <today's midnight timestamp here>)
.count()
Like is there any hack/solution for fetching count of documents without actually fetching them.
Since there is no COUNT()in Firestore (as yet) you are forced to query your documents to calculate or maintain the total number of records globally or based on arguments.
There is an extension to do just that but I had trouble configuring it.
I personaly use a simple Cloud functions that I know is not completely bulletproof but good enough to provide me with a dbutils collection where I store the latest count of any collection. I also maintain count subcollections to store the number of records for the same collection but with query parameters. Quite necessary to propose a REST API...
import * as functions from 'firebase-functions'
import * as admin from 'firebase-admin'
export const dbutils = functions
.firestore.document('/{collection}/{id}')
.onWrite((change, context) => {
const db = admin.firestore()
const docRef = `dbutils/${context.params.collection}`
let value = 0
if (!change.before.exists) {
value += 1 // new record
} else if (!change.after.exists) {
value -= 1 // deleted record
}
// ignore last case which is update
if (value !== 0) {
db.doc(docRef)
.set(
{
count: admin.firestore.FieldValue.increment(value),
updatedAt: Date.now(),
},
{ merge: true },
)
.then(res => deleteCollection(db, `${docRef}/counts`, 500))
.catch(err => console.log(err))
}
return null
})
const deleteCollection = (
db: admin.firestore.Firestore,
collectionPath: string,
batchSize: number,
) => {
const collectionRef = db.collection(collectionPath)
const query = collectionRef.orderBy('__name__').limit(batchSize)
return new Promise((resolve, reject) => {
deleteQueryBatch(db, query, batchSize, resolve, reject)
})
}
I'm confident we'll get a COUNT() sooner or later!

How to query batch by batch from ElasticSearch in nodejs

I'm trying to get data from ElasticSearch with my node application. In my index, there are 1 million records, thus I cannot be sent to another services with the whole records. That's why I want to get 10,000 records per request, as per example:
const getCodesFromElasticSearch = async (batch) => {
let startingCount = 0;
if (batch > 1) {
startingCount = (batch * 1000);
} else if (batch === 1) {
startingCount = 0;
}
return await esClient.search({
index: `myIndex`,
type: 'codes',
_source: ['column1', 'column2', 'column3'],
body: {
from: startingCount,
size: 1000,
query: {
bool: {
must: [
....
],
filter: {
....
}
}
},
sort: {
sequence: {
order: "asc"
}
}
}
}).then(data => data.hits.hits.map(esObject => esObject._source));
}
It's still working when batch=1. But when goes to batch=2, that got problem that from should not be larger than 10,000 as per its documentation. And I don't want to change max_records as well. Please let me know any alternate way to get 10,000 by 10,000.
The scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use the cursor on a traditional database.
So you can use scroll API to get your whole 1M dataset below-something like below without using from because elasticsearch normal search has a limit of 10k record in max request so when you try to use from with greater value then it'll return error, that's why scrolling is good solutions for this kind of scenarios.
let allRecords = [];
// first we do a search, and specify a scroll timeout
var { _scroll_id, hits } = await esclient.search({
index: 'myIndex',
type: 'codes',
scroll: '30s',
body: {
query: {
"match_all": {}
},
_source: ["column1", "column2", "column3"]
}
})
while(hits && hits.hits.length) {
// Append all new hits
allRecords.push(...hits.hits)
console.log(`${allRecords.length} of ${hits.total}`)
var { _scroll_id, hits } = await esclient.scroll({
scrollId: _scroll_id,
scroll: '30s'
})
}
console.log(`Complete: ${allRecords.length} records retrieved`)
You can also add your query and sort with this existing code snippets.
As per comment:
Step 1. Do normal esclient.search and get the hits and _scroll_id. Here you need to send the hits data to your other service and keep the _scroll_id for a future batch of data calling.
Step 2 Use the _scroll_id from the first batch and use a while loop until you get all your 1M record with esclient.scroll. Here you need to keep in mind that you don't need to wait for all of your 1M data, within the while loop when you get response back just send it to your service batch by batch.
See Scroll API: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/scroll_examples.html
**See Search After **: https://www.elastic.co/guide/en/elasticsearch/reference/5.2/search-request-search-after.html

How to fetch more than 100 records from azure cosmos db using query

I want to fetch more than 100 records from azure-cosmos DB using select query.
I am writing a stored procedure and using a select query to fetch the record.
SELECT * FROM activities a
I am getting only 100 records though there are more than 500 records.
I am able to get all records usings the setting configuration provided by Azure.
I want to perform the same operation using query or stored procedure. How can I do that ??
Please suggest changes that need to accomplish.
I am writing a stored procedure and using a select query to fetch the record.
SELECT * FROM activities a
I am getting only 100 records though there are more than 500 records.
The default value of FeedOptions pageSize property for queryDocuments is 100, which might be the cause of the issue. Please try to set the value to -1. The following stored procedure works fine on my side, please refer to it.
function getall(){
var context = getContext();
var response = context.getResponse();
var collection = context.getCollection();
var collectionLink = collection.getSelfLink();
var filterQuery = 'SELECT * FROM c';
collection.queryDocuments(collectionLink, filterQuery, {pageSize:-1 },
function(err, documents) {
response.setBody(response.getBody() + JSON.stringify(documents));
}
);
}
If anyone hits this page, the answers above are obsolete.
#azure/cosmos now has some options like below for those who are interested:
const usersQuery = {
query: "SELECT * FROM c where c.userId = 'someid'" +
" order by c.userId asc, c.timestamp asc"
};
const { resources: users } = await container.items
.query(usersQuery, { maxDegreeOfParallelism: 5,maxItemCount: 10000 }).fetchNext()
For reference, see here.

How to make pagination with mongoose

I want to make a pagination feature in my Collection. How can find a documents with 'start' and 'limit' positions and get total document number in a single query?
You can't get both results in one query; the best you can do is to get them both using one Mongoose Query object:
var query = MyModel.find({});
query.count(function(err, count) {...});
query.skip(5).limit(10).exec('find', function(err, items) {...});
Use a flow control framework like async to cleanly execute them in parallel if needed.
You can use the plugin Mongoose Paginate:
$ npm install mongoose-paginate
After in your routes or controller, just add :
Model.paginate({}, { page: 3, limit: 10 }, function(err, result) {
// result.docs
// result.total
// result.limit - 10
// result.page - 3
// result.pages
});
If you plan to have a lot of pages, you should not use skip/limit, but rather calculate ranges.
See Scott's answer for a similar question: MongoDB - paging
UPDATE :
Using skip and limit is not good for pagination. Here is the discussion over it.
#Wes Freeman, Gave a good answer. I read the linked pose, you should use range query. n = 0; i = n+10; db.students.find({ "id" : { $gt: n, $lt: (n+i)} } );
OLD ANSWER (don't use this)
use something like
n = db.students.find().skip(n).limit(10);
//pass n dynamically, so for page 1 it should be 0 , page 2 it should be 10 etc
more documentation at http://www.mongodb.org/display/DOCS/Advanced+Queries
user.find({_id:{$nin:friends_id},_id:{$ne:userId}},function(err,user_details){
if (err)
res.send(err);
response ={
statusCode:200,
status:true,
values:user_details
}
res.json(response);
}).skip(10).limit(1);
I am using this function,
You can check if prev and next data is available or not.
async (req, res) => {
let { page = 1, size = 10 } = req.query
page = parseInt(page)
size = parseInt(size)
const query = {}
const totalData = await Model.find().estimatedDocumentCount()
const data = await Model.find(query).skip((page - 1) * size).limit(size).exec()
const pageNumber = Math.ceil(totalData / size)
const results = {
currentPage: page,
prevPage: page <= 1 ? null : page - 1,
nextPage: page >= pageNumber ? null : page + 1,
data
}
res.json(results) }
To know more estimatedDocumentCount

Resources