DynamoDB client doesn't fulfil the limit - node.js

Client lib : "#aws-sdk/client-dynamodb": "3.188.0"
I've a DynamoDB pagination implementation.
My user count is 98 & page size is 20. Therefore I'm expecting 5 pages & each having 20,20,20,20 & 18 users in the result.
But actually I'm getting more than 5 pages and each page having variable number of users like 10, 12, 11 ..etc.
How can I get users with proper page limit like 20, 20, 20, 20 & 18?
public async pagedList(usersPerPage: number, lastEvaluatedKey?: string): Promise<PagedUser> {
const params = {
TableName: tableName,
Limit: usersPerPage,
FilterExpression: '#type = :type',
ExpressionAttributeValues: {
':type': { S: type },
},
ExpressionAttributeNames: {
'#type': 'type',
},
} as ScanCommandInput;
if (lastEvaluatedKey) {
params.ExclusiveStartKey = { 'oid': { S: lastEvaluatedKey } };
}
const command = new ScanCommand(params);
const data = await client.send(command);
const users: User[] = [];
if (data.Items !== undefined) {
data.Items.forEach((item) => {
if (item !== undefined) {
users.push(this.makeUser(item));
}
});
}
let lastKey;
if (data.LastEvaluatedKey !== undefined) {
lastKey = data.LastEvaluatedKey.oid.S?.valueOf();
}
return {
users: users,
lastEvaluatedKey: lastKey
};
}

The scan command documentation https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.Pagination
Provides few reasons why your result may contain less results:
The result must fit in 1 MB
If a filter is applied, the data is filtered "after scan". You have a filter in your query.
From the docs
A filter expression is applied after a Scan finishes but before the
results are returned. Therefore, a Scan consumes the same amount of
read capacity, regardless of whether a filter expression is present.
...
Now suppose that you add a filter expression to the Scan. In this case, DynamoDB applies the filter expression to the six items that were returned, discarding those that do not match. The final Scan result contains six items or fewer, depending on the number of items that were filtered.
In the next section it is explained how you can verify that it is possibly your case:
Counting the items in the results
In addition to the items that match your criteria, the Scan response contains the following
elements:
ScannedCount — The number of items evaluated, before any ScanFilter is
applied. A high ScannedCount value with few, or no, Count results
indicates an inefficient Scan operation. If you did not use a filter
in the request, ScannedCount is the same as Count.

Related

How to return sequelize findAll(array) in the original order of the array?

I have a table that represents a list which can be reordered on the front end. I also have a table which stores the items in that list. So the ordering info is kept separate from the actual items. When I reorder a list say:
id item order
------------------
1 item-1 1
2 item-2 2
3 item-3 3
to
id item order
------------------
1 item-1 3
2 item-2 1
3 item-3 2
you can see the order column no longer lines up with the id column. In order to ensure consistent object building I have a generic function which takes in an array of ids and preforms a findAll() query on the table that holds the actual items, not their order. The problem I'm having is that the findAll() query seems to default the orderby clause to be based on the ids it's passed, when what I want is for the array of models to be returned in the same order as the array of ids was fed into it.
While I could do a findOne() in the sequential order of the ids, that seems like an inefficient way to do this.
The actual function is question is this:
async buildQuestion(question){
// check if question is array to allow for faster querying of multiple questions
if(Array.isArray(question)){
const ids = question.filter(item => item.id).map(item => item.id);
const questions = db.questions.findAll({
where: { id: ids },
include: [db.formats, db.categories, db.types, db.questionHints, db.questionVideos]
});
return questions.map(question => new Promise(resolve => _build(question).then(item => resolve(item))));
}else{
if (!question.id) throw new Error('question.id required at Builder.buildQuestion()');
question = await db.questions.findOne({
where: { id: question.id },
include: [db.formats, db.categories, db.types, db.questionHints, db.questionVideos]
});
return _build(question);
}
AFAIK, the only way to order a SQL result set is with the order by clause, so you would have to add the current order to the database table to accomplish your goal.
If this isn't possible, you could write js code to add the current order value to the result set, then sort. This would likely more efficient than multiple queries, particularly if the result set is large.
Here's some sample code that works... though it may not be the cleanest code :)
let myOrder = [3, 1, 2];
User.findAll({
attributes: ['id', 'username',
[Sequelize.literal("0"), 'myOrder'] /* to hold desired order */
],
where : {id : myOrder}
})
.then(users => {
/* add desired sequence to result set */
for (let ord = 0; ord < myOrder.length; ord++) {
let usr = users.find(user => user.id == myOrder[ord]);
if (usr) {
usr.dataValues.myOrder = ord;
}
}
/* sort in desired order */
users.sort((a, b) => a.dataValues.myOrder - b.dataValues.myOrder);
res.send(users);
next();
});

How to query batch by batch from ElasticSearch in nodejs

I'm trying to get data from ElasticSearch with my node application. In my index, there are 1 million records, thus I cannot be sent to another services with the whole records. That's why I want to get 10,000 records per request, as per example:
const getCodesFromElasticSearch = async (batch) => {
let startingCount = 0;
if (batch > 1) {
startingCount = (batch * 1000);
} else if (batch === 1) {
startingCount = 0;
}
return await esClient.search({
index: `myIndex`,
type: 'codes',
_source: ['column1', 'column2', 'column3'],
body: {
from: startingCount,
size: 1000,
query: {
bool: {
must: [
....
],
filter: {
....
}
}
},
sort: {
sequence: {
order: "asc"
}
}
}
}).then(data => data.hits.hits.map(esObject => esObject._source));
}
It's still working when batch=1. But when goes to batch=2, that got problem that from should not be larger than 10,000 as per its documentation. And I don't want to change max_records as well. Please let me know any alternate way to get 10,000 by 10,000.
The scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use the cursor on a traditional database.
So you can use scroll API to get your whole 1M dataset below-something like below without using from because elasticsearch normal search has a limit of 10k record in max request so when you try to use from with greater value then it'll return error, that's why scrolling is good solutions for this kind of scenarios.
let allRecords = [];
// first we do a search, and specify a scroll timeout
var { _scroll_id, hits } = await esclient.search({
index: 'myIndex',
type: 'codes',
scroll: '30s',
body: {
query: {
"match_all": {}
},
_source: ["column1", "column2", "column3"]
}
})
while(hits && hits.hits.length) {
// Append all new hits
allRecords.push(...hits.hits)
console.log(`${allRecords.length} of ${hits.total}`)
var { _scroll_id, hits } = await esclient.scroll({
scrollId: _scroll_id,
scroll: '30s'
})
}
console.log(`Complete: ${allRecords.length} records retrieved`)
You can also add your query and sort with this existing code snippets.
As per comment:
Step 1. Do normal esclient.search and get the hits and _scroll_id. Here you need to send the hits data to your other service and keep the _scroll_id for a future batch of data calling.
Step 2 Use the _scroll_id from the first batch and use a while loop until you get all your 1M record with esclient.scroll. Here you need to keep in mind that you don't need to wait for all of your 1M data, within the while loop when you get response back just send it to your service batch by batch.
See Scroll API: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/scroll_examples.html
**See Search After **: https://www.elastic.co/guide/en/elasticsearch/reference/5.2/search-request-search-after.html

How to order by twice with MongoDB, Mongoose, and NodeJS [duplicate]

I am looking to get a random record from a huge collection (100 million records).
What is the fastest and most efficient way to do so?
The data is already there and there are no field in which I can generate a random number and obtain a random row.
Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:
// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])
If you want to select the random document(s) from a filtered subset of the collection, prepend a $match stage to the pipeline:
// Get one random document matching {a: 10} from the mycoll collection.
db.mycoll.aggregate([
{ $match: { a: 10 } },
{ $sample: { size: 1 } }
])
As noted in the comments, when size is greater than 1, there may be duplicates in the returned document sample.
Do a count of all records, generate a random number between 0 and the count, and then do:
db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()
Update for MongoDB 3.2
3.2 introduced $sample to the aggregation pipeline.
There's also a good blog post on putting it into practice.
For older versions (previous answer)
This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix."
The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/
To paraphrase the recipe, you assign random numbers to your documents:
db.docs.save( { key : 1, ..., random : Math.random() } )
Then select a random document:
rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}
Querying with both $gte and $lte is necessary to find the document with a random number nearest rand.
And of course you'll want to index on the random field:
db.docs.ensureIndex( { key : 1, random :1 } )
If you're already querying against an index, simply drop it, append random: 1 to it, and add it again.
You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.
First, enable geospatial indexing on a collection:
db.docs.ensureIndex( { random_point: '2d' } )
To create a bunch of documents with random points on the X-axis:
for ( i = 0; i < 10; ++i ) {
db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}
Then you can get a random document from the collection like this:
db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )
Or you can retrieve several document nearest to a random point:
db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )
This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.
The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.
function draw(collection, query) {
// query: mongodb query object (optional)
var query = query || { };
query['random'] = { $lte: Math.random() };
var cur = collection.find(query).sort({ rand: -1 });
if (! cur.hasNext()) {
delete query.random;
cur = collection.find(query).sort({ rand: -1 });
}
var doc = cur.next();
doc.random = Math.random();
collection.update({ _id: doc._id }, doc);
return doc;
}
It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey
function addRandom(collection) {
collection.find().forEach(function (obj) {
obj.random = Math.random();
collection.save(obj);
});
}
db.eval(addRandom, db.things);
Benchmark results
This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:
For a collection with 1,000,000 elements:
This method takes less than a millisecond on my machine
the skip() method takes 180 ms on average
The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.
This method will pick all elements evenly over time.
In my benchmark it was only 30% slower than the cookbook method.
the randomness is not 100% perfect but it is very good (and it can be improved if necessary)
This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.
Here is a way using the default ObjectId values for _id and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the
// diff between.
// 4-bytes from a hex string is 8 characters
var min = parseInt(db.collection.find()
.sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
max = parseInt(db.collection.find()
.sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
diff = max - min;
// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
.sort({ "_id": 1 }).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.
Now you can use the aggregate.
Example:
db.users.aggregate(
[ { $sample: { size: 3 } } ]
)
See the doc.
In Python using pymongo:
import random
def get_random_doc():
count = collection.count()
return collection.find()[random.randrange(count)]
Using Python (pymongo), the aggregate function also works.
collection.aggregate([{'$sample': {'size': sample_size }}])
This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.
it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:
lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;
then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):
unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)
V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();
randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);
You can pick a random timestamp and search for the first object that was created afterwards.
It will only scan a single document, though it doesn't necessarily give you a uniform distribution.
var randRec = function() {
// replace with your collection
var coll = db.collection
// get unixtime of first and last record
var min = coll.find().sort({_id: 1}).limit(1)[0]._id.getTimestamp() - 0;
var max = coll.find().sort({_id: -1}).limit(1)[0]._id.getTimestamp() - 0;
// allow to pass additional query params
return function(query) {
if (typeof query === 'undefined') query = {}
var randTime = Math.round(Math.random() * (max - min)) + min;
var hexSeconds = Math.floor(randTime / 1000).toString(16);
var id = ObjectId(hexSeconds + "0000000000000000");
query._id = {$gte: id}
return coll.find(query).limit(1)
};
}();
My solution on php:
/**
* Get random docs from Mongo
* #param $collection
* #param $where
* #param $fields
* #param $limit
* #author happy-code
* #url happy-code.com
*/
private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) {
// Total docs
$count = $collection->find($where, $fields)->count();
if (!$limit) {
// Get all docs
$limit = $count;
}
$data = array();
for( $i = 0; $i < $limit; $i++ ) {
// Skip documents
$skip = rand(0, ($count-1) );
if ($skip !== 0) {
$doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext();
} else {
$doc = $collection->find($where, $fields)->limit(1)->getNext();
}
if (is_array($doc)) {
// Catch document
$data[ $doc['_id']->{'$id'} ] = $doc;
// Ignore current document when making the next iteration
$where['_id']['$nin'][] = $doc['_id'];
}
// Every iteration catch document and decrease in the total number of document
$count--;
}
return $data;
}
In order to get a determinated number of random docs without duplicates:
first get all ids
get size of documents
loop geting random index and skip duplicated
number_of_docs=7
db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) {
count=arr.length
idsram=[]
rans=[]
while(number_of_docs!=0){
var R = Math.floor(Math.random() * count);
if (rans.indexOf(R) > -1) {
continue
} else {
ans.push(R)
idsram.push(arr[R]._id)
number_of_docs--
}
}
db.collection('preguntas').find({}).toArray(function(err1, doc1) {
if (err1) { console.log(err1); return; }
res.send(doc1)
});
});
The best way in Mongoose is to make an aggregation call with $sample.
However, Mongoose does not apply Mongoose documents to Aggregation - especially not if populate() is to be applied as well.
For getting a "lean" array from the database:
/*
Sample model should be init first
const Sample = mongoose …
*/
const samples = await Sample.aggregate([
{ $match: {} },
{ $sample: { size: 33 } },
]).exec();
console.log(samples); //a lean Array
For getting an array of mongoose documents:
const samples = (
await Sample.aggregate([
{ $match: {} },
{ $sample: { size: 27 } },
{ $project: { _id: 1 } },
]).exec()
).map(v => v._id);
const mongooseSamples = await Sample.find({ _id: { $in: samples } });
console.log(mongooseSamples); //an Array of mongoose documents
I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.
function mapf() {
if(Math.random() <= probability) {
emit(1, this);
}
}
function reducef(key,values) {
return {"documents": values};
}
res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);
The reducef function above works because only one key ('1') is emitted from the map function.
The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)
Using mapReduce like this should also be usable on a sharded db.
If you want to select exactly n of m documents from the db, you could do it like this:
function mapf() {
if(countSubset == 0) return;
var prob = countSubset / countTotal;
if(Math.random() <= prob) {
emit(1, {"documents": [this]});
countSubset--;
}
countTotal--;
}
function reducef(key,values) {
var newArray = new Array();
for(var i=0; i < values.length; i++) {
newArray = newArray.concat(values[i].documents);
}
return {"documents": newArray};
}
res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);
Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.
This approach might give some problems on sharded databases.
You can pick random _id and return corresponding object:
db.collection.count( function(err, count){
db.collection.distinct( "_id" , function( err, result) {
if (err)
res.send(err)
var randomId = result[Math.floor(Math.random() * (count-1))]
db.collection.findOne( { _id: randomId } , function( err, result) {
if (err)
res.send(err)
console.log(result)
})
})
})
Here you dont need to spend space on storing random numbers in collection.
The following aggregation operation randomly selects 3 documents from the collection:
db.users.aggregate(
[ { $sample: { size: 3 } } ]
)
https://docs.mongodb.com/manual/reference/operator/aggregation/sample/
MongoDB now has $rand
To pick n non repeat items, aggregate with { $addFields: { _f: { $rand: {} } } } then $sort by _f and $limit n.
I'd suggest adding a random int field to each object. Then you can just do a
findOne({random_field: {$gte: rand()}})
to pick a random document. Just make sure you ensureIndex({random_field:1})
When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB.
In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).
If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user.
If the universe of products is small enough, you can create an index per user.
I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution.
non of the solutions worked well for me. especially when there are many gaps and set is small.
this worked very well for me(in php):
$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();
My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.
Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.
First I create an array with 10 randomly generated numbers
$randomNumbers = [];
for($i = 0; $i < 10; $i++){
$randomNumbers[] = rand(0,1000);
}
In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.
$aggregate[] = [
'$addFields' => [
'random_sort' => [ '$arrayElemAt' => [ $randomNumbers, [ '$mod' => [ '$my_numeric_mysql_id', 10 ] ] ] ],
],
];
After that you can use the sort Pipeline.
$aggregate[] = [
'$sort' => [
'random_sort' => 1
]
];
My simplest solution to this ...
db.coll.find()
.limit(1)
.skip(Math.floor(Math.random() * 500))
.next()
Where you have at least 500 items on collections
If you have a simple id key, you could store all the id's in an array, and then pick a random id. (Ruby answer):
ids = #coll.find({},fields:{_id:1}).to_a
#coll.find(ids.sample).first
Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with.
I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD...
db.toc_content.mapReduce(
/* map function */
function() { emit( 1, this._id ); },
/* reduce function */
function(k,v) {
var r = Math.floor((Math.random()*v.length));
return v[r];
},
/* options */
{
out: { inline: 1 },
/* Filter the collection to "A"ctive documents */
query: { status: "A" }
}
);
The Map function simply creates an array of the id's of all documents that match the query. In my case I tested this with approximately 30,000 out of the 50,000 possible documents.
The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array.
400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations.
There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533
If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. (go vote it up!)
This works nice, it's fast, works with multiple documents and doesn't require populating rand field, which will eventually populate itself:
add index to .rand field on your collection
use find and refresh, something like:
// Install packages:
// npm install mongodb async
// Add index in mongo:
// db.ensureIndex('mycollection', { rand: 1 })
var mongodb = require('mongodb')
var async = require('async')
// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
var result = []
var rand = Math.random()
// Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
var appender = function (criteria, options, done) {
return function (done) {
if (options.limit > 0) {
collection.find(criteria, fields, options).toArray(
function (err, docs) {
if (!err && Array.isArray(docs)) {
Array.prototype.push.apply(result, docs)
}
done(err)
}
)
} else {
async.nextTick(done)
}
}
}
async.series([
// Fetch docs with unitialized .rand.
// NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
appender({ rand: { $exists: false } }, { limit: n - result.length }),
// Fetch on one side of random number.
appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),
// Continue fetch on the other side.
appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),
// Refresh fetched docs, if any.
function (done) {
if (result.length > 0) {
var batch = collection.initializeUnorderedBulkOp({ w: 0 })
for (var i = 0; i < result.length; ++i) {
batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
}
batch.execute(done)
} else {
async.nextTick(done)
}
}
], function (err) {
done(err, result)
})
}
// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
if (!err) {
findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
if (!err) {
console.log(result)
} else {
console.error(err)
}
db.close()
})
} else {
console.error(err)
}
})
ps. How to find random records in mongodb question is marked as duplicate of this question. The difference is that this question asks explicitly about single record as the other one explicitly about getting random documents.
For me, I wanted to get the same records, in a random order, so I created an empty array used to sort, then generated random numbers between one and 7( I have seven fields). So each time I get a different value, I assign a different random sort.
It is 'layman' but it worked for me.
//generate random number
const randomval = some random value;
//declare sort array and initialize to empty
const sort = [];
//write a conditional if else to get to decide which sort to use
if(randomval == 1)
{
sort.push(...['createdAt',1]);
}
else if(randomval == 2)
{
sort.push(...['_id',1]);
}
....
else if(randomval == n)
{
sort.push(...['n',1]);
}
If you're using mongoid, the document-to-object wrapper, you can do the following in
Ruby. (Assuming your model is User)
User.all.to_a[rand(User.count)]
In my .irbrc, I have
def rando klass
klass.all.to_a[rand(klass.count)]
end
so in rails console, I can do, for example,
rando User
rando Article
to get documents randomly from any collection.
you can also use shuffle-array after executing your query
var shuffle = require('shuffle-array');
Accounts.find(qry,function(err,results_array){
newIndexArr=shuffle(results_array);
What works efficiently and reliably is this:
Add a field called "random" to each document and assign a random value to it, add an index for the random field and proceed as follows:
Let's assume we have a collection of web links called "links" and we want a random link from it:
link = db.links.find().sort({random: 1}).limit(1)[0]
To ensure the same link won't pop up a second time, update its random field with a new random number:
db.links.update({random: Math.random()}, link)

How can I do DynamoDB limit after filtering?

I would like to implement a DynamoDB Scan with the following logic:
Scanning -> Filtering(boolean true or false) -> Limiting(for pagination)
However, I have only been able to implement a Scan with this logic:
Scanning -> Limiting(for pagination) -> Filtering(boolean true or false)
How can I achieve this?
Below is an example I have written that implements the second Scan logic:
var parameters = {
TableName: this.tableName,
Limit: queryStatement.limit
};
if ('role' in queryStatement) {
parameters.FilterExpression = '#role = :role';
parameters.ExpressionAttributeNames = {
'#role': 'role'
};
parameters.ExpressionAttributeValues = {
':role': queryStatement.role
};
}
if ('startKey' in queryStatement) {
parameters.ExclusiveStartKey = { id: queryStatement.startKey};
}
this.documentClient.scan(parameters, (errorResult, result) => {
if (errorResult) {
errorResult._status = 500;
return reject(errorResult);
}
return resolve(result);
});
This codes works like second one.
Scanning -> Limiting -> Filtering
The DynamoDB LIMIT works as mentioned below (i.e. second approach in your post) by design. As it works by design, there is no solution for this.
LastEvaluatedKey should be used to get the data on subsequent scans.
Scanning -> Limiting(for pagination) -> Filtering(boolean true or false)
In a request, set the Limit parameter to the number of items that you
want DynamoDB to process before returning results.
In a response, DynamoDB returns all the matching results within the
scope of the Limit value. For example, if you issue a Query or a Scan
request with a Limit value of 6 and without a filter expression,
DynamoDB returns the first six items in the table that match the
specified key conditions in the request (or just the first six items
in the case of a Scan with no filter). If you also supply a
FilterExpression value, DynamoDB will return the items in the first
six that also match the filter requirements (the number of results
returned will be less than or equal to 6).
For either a Query or Scan operation, DynamoDB might return a
LastEvaluatedKey value if the operation did not return all matching
items in the table. To get the full count of items that match, take
the LastEvaluatedKey value from the previous request and use it as the
ExclusiveStartKey value in the next request. Repeat this until
DynamoDB no longer returns a LastEvaluatedKey value.
Use --max-items=2 instead of --limit=2, max-items will do limit after filtering.
Sample query with max-items:
aws dynamodb query --table-name=limitTest --key-condition-expression="gsikey=:hash AND gsisort>=:sort" --expression-attribute-values '{ ":hash":{"S":"1"}, ":sort":{"S":"1"}, ":levels":{"N":"10"}}' --filter-expression="levels >= :levels" --scan-index-forward --max-items=2 --projection-expression "levels,key1" --index-name=gsikey-gsisort-index
Sample query with limit:
aws dynamodb query --table-name=limitTest --key-condition-expression="gsikey=:hash AND gsisort>=:sort" --expression-attribute-values '{ ":hash":{"S":"1"}, ":sort":{"S":"1"}, ":levels":{"N":"10"}}' --filter-expression="levels >= :levels" --scan-index-forward --limit=2 --projection-expression "levels,key1" --index-name=gsikey-gsisort-index
If there is just one field that is of interest for the pagination you could create an index with that field as a key. Then you do not need to recurse for the number of items in the limit.
limts add now in dynamodb
var params = {
TableName: "message",
IndexName: "thread_id-timestamp-index",
KeyConditionExpression: "#mid = :mid",
ExpressionAttributeNames: {
"#mid": "thread_id"
},
ExpressionAttributeValues: {
":mid": payload.thread_id
},
Limit: (3 , 2 ,3),
LastEvaluatedKey: 1,
ScanIndexForward: false
};
req.dynamo.query(params, function (err, data) {
console.log(err, data);
})
Limit defines the number of items/records evaluated using only the KeyCondition(If Present) before applying any filters. To solve this problem, as pointed out in earlier solutions, one approach could be to use a GSI, where the filtering condition is part of your GSI key. However, this is fairly restrictive as it's not practical to introduce a new GSI for every access pattern which requires pagination. A more realistic approach could be to query track the Count in the query response and keep querying & appending the next page of results until the aggregated Count satisfies the client-defined Limit. Keep in mind, that you would need to have some custom logic to build the LastEvaluatedKey, which would be required to fetch the subsequent result-page. In Go, this can be achieved in the following way.
func PaginateWithFilters(ctx context.Context, keyCondition string, filteringCondition int, cursor *Cursor) ([]*Records, error) {
var collectiveResult []map[string]types.AttributeValue
var records []*Records
expr, err := buildFilterQueryExpression(keyCondition, filteringCondition)
if err != nil {
return nil, err
}
queryInput := &dynamodb.QueryInput{
ExpressionAttributeNames: expr.Names(),
ExpressionAttributeValues: expr.Values(),
KeyConditionExpression: expr.KeyCondition(),
FilterExpression: expr.Filter(),
TableName: aws.String(yourTableName),
Limit: aws.Int32(cursor.PageLimit),
}
if cursor.LastEvaluatedKey != nil {
queryInput.ExclusiveStartKey = cursor.LastEvaluatedKey
}
paginator := dynamodb.NewQueryPaginator(dbClient, queryInput)
for {
if !paginator.HasMorePages() {
fmt.Println("no more records in the partition")
cursor.LastEvaluatedKey = nil
break
}
singlePage, err := paginator.NextPage(ctx)
if err != nil {
return nil, err
}
pendingItems := int(cursor.PageLimit) - len(collectiveResult)
if int(singlePage.Count) >= pendingItems {
collectiveResult = append(collectiveResult, singlePage.Items[:pendingItems]...)
cursor.LastEvaluatedKey = buildExclusiveStartKey(singlePage.Items[pendingItems-1])
break
}
collectiveResult = append(collectiveResult, singlePage.Items...)
}
err = attributevalue.UnmarshalListOfMaps(collectiveResult, &records)
if err != nil {
return nil, err
}
return records, nil
}
This Medium article discusses pagination with DynamoDB in a bit more depth and includes code snippets in Go to paginate query responses with filters.

AWS DynamoDB Node.js scan- certain number of results

I try to get first 10 items which satisfy condition from DynamoDB using lambda AWS. I was trying to use Limit parameter but it is (basis on that website)
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB.html#scan-property
"maximum number of items to evaluate (not necessarily the number of matching items)".
How to get 10 first items which satisfy my condition?
var AWS = require('aws-sdk');
var db = new AWS.DynamoDB();
exports.handler = function(event, context) {
var params = {
TableName: "Events", //"StreamsLambdaTable",
ProjectionExpression: "ID, description, endDate, imagePath, locationLat, locationLon, #nm, startDate, #tp, userLimit", //specifies the attributes you want in the scan result.
FilterExpression: "locationLon between :lower_lon and :higher_lon and locationLat between :lower_lat and :higher_lat",
ExpressionAttributeNames: {
"#nm": "name",
"#tp": "type",
},
ExpressionAttributeValues: {
":lower_lon": {"N": event.low_lon},
":higher_lon": {"N": event.high_lon}, //event.high_lon}
":lower_lat": {"N": event.low_lat},
":higher_lat": {"N": event.high_lat}
}
};
db.scan(params, function(err, data) {
if (err) {
console.log(err); // an error occurred
}
else {
data.Items.forEach(function(record) {
console.log(
record.name.S + "");
});
context.succeed(data.Items);
}
});
};
I think you already know the reason behind this: the distinction that DynamoDB makes between ScannedCount and Count. As per this,
ScannedCount — the number of items that were queried or scanned,
before any filter expression was applied to the results.
Count — the
number of items that were returned in the response.
The fix for that is documented right above this:
For either a Query or Scan operation, DynamoDB might return a LastEvaluatedKey value if the operation did not return all matching items in the table. To get the full count of items that match, take the LastEvaluatedKey value from the previous request and use it as the ExclusiveStartKey value in the next request. Repeat this until DynamoDB no longer returns a LastEvaluatedKey value.
So, the answer to your question is: use the LastEvaluatedKey from DynamoDB response and Scan again.

Resources