Get top 100 documents based on multiple fields

Get top 100 documents based on multiple fields - node.js

I am trying to get the 100 documents from my DB based on the sum a few fields.
The data is similar to this:
{
"userID": "227837830704005140",
"shards": {
"ancient": {
"pulled": 410
},
"void": {
"pulled": 1671
},
"sacred": {
"pulled": 719
}
}
}
I want to sum the "pulled" number for the 3 types of shard, and then use that to determine the top 100.
I tried this in nodejs:
let top100: IShardData[] = await collection.find<IShardData>({}, {
projection: {
"shards.ancient.pulled": 1, "shards.void.pulled": 1, "shards.sacred.pulled": 1, orderBySumValue: { $add: ["$shards.ancient.pulled", "$shards.void.pulled", "$shards.sacred.pulled"] }
}, sort: { orderBySumValue: 1 }, limit: 100
}).toArray()
This connects to the DB, gets the right collection, and seems to sum up the values correctly but is not sorting them for the top 100 by the sum of the fields. I used this as a basis for my code: https://www.tutorialspoint.com/mongodb-order-by-two-fields-sum
Not sure what I need to do to make it work. Any help is appreciated.
Thanks in advance!

Here's one way to do it using an aggregation pipeline.
db.collection.aggregate([
{
// create new field for the sum
"$set": {
"pulledSum": {
"$sum": [
"$shards.ancient.pulled",
"$shards.void.pulled",
"$shards.sacred.pulled"
]
}
}
},
{
// sort on the sum
"$sort": {
"pulledSum": -1
}
},
{
// limit to the desired number
"$limit": 10
},
{
// don't return some fields
"$unset": [
"_id",
"pulledSum"
]
}
])
Try it on mongoplayground.net.

sort should be written with a dollar sign.
$sort: { orderBySumValue: 1 }

Related

How can I implement Auto calculated field in mongodb?

I have a collection of workorders where it have time_started and time_completed values. I want to have auto-calculated field called duration that automatically calculated time_completed - time_started. What is the best way?
Essentially what I want is, when App post requests with a completed time, my duration is auto calculated.
Example data
router.post('/completed', function (req, res) {
const time_completed = req.body.time_completed
const workorder_id = req.body.workorder_id
db.collection(workorder).updateOne(
{ _id: ObjectId(workorder_id) },
{
$set: {
time_completed: time_completed,
}
},
function (err, result) {
if (err) throw err
res.send('Updated')
}
)
});

Query
pipeline update requires >= MongoDB 4.2
add the time_completed
add the duration also
*replace the 6 with the javascript variable that holds the time_completed Date
*duration will be in milliseconds
Test code here
db.collection.update(
{"_id": 1},
[
{
"$set": {
"time_completed": 6,
"duration": {
"$subtract": [
6,
"$time_started"
]
}
}
}
])
Edit
You have strings on your database, i thought it was dates, best thing to do is to convert all those string-dates to Date with $dataFromString like the code bellow, and use the first query.
To get the string if you needed from Date you can do $stringFromDate when you need it.
Query
same like above but it converts string dates to Date to do the substraction (keeps the dates in strings inside the database)
Test code here
db.collection.update({
"_id": 1
},
[
{
"$set": {
"time_completed": "2021-11-21T00:00:00.000Z",
"duration": {
"$subtract": [
ISODate("2021-11-21T00:00:00Z"),
{
"$dateFromString": {
"dateString": "$time_started"
}
}
]
}
}
}
])

Use $lookup with a Conditional Join

provided I have following documents
User
{
uuid: string,
isActive: boolean,
lastLogin: datetime,
createdOn: datetime
}
Projects
{
id: string,
users: [
{
uuid: string,
otherInfo: ...
},
{... more users}
]
}
And I want to select all users that didn't login since 2 weeks and are inactive or since 5 weeks that don't have projects.
Now, the 2 weeks is working fine but I cannot seem to figure out how to do the "5 weeks and don't have projects" part
I came up with something like below but the last part does not work because $exists obviously is not a top level operator.
Anyone ever did anything like this?
Thanks!
return await this.collection
.aggregate([
{
$match: {
$and: [
{
$expr: {
$allElementsTrue: {
$map: {
input: [`$lastLogin`, `$createdOn`],
in: { $lt: [`$$this`, twoWeeksAgo] }
}
}
}
},
{
$or: [
{
isActive: false
},
{
$and: [
{
$expr: {
$allElementsTrue: {
$map: {
input: [`$lastLogin`, `$createdOn`],
in: { $lt: [`$$this`, fiveWeeksAgo] }
}
}
}
},
{
//No projects exists on this user
$exists: {
$lookup: {
from: _.get(Config, `env.collection.projects`),
let: {
currentUser: `$$ROOT`
},
pipeline: [
{
$project: {
_id: 0,
users: {
$filter: {
input: `$users`,
as: `user`,
cond: {
$eq: [`$$user.uuid`, `$currentUser.uuid`]
}
}
}
}
}
]
}
}
}
]
}
]
}
]
}
}
])
.toArray();

Not certain why you thought $expr was needed in the initial $match, but really:
const getResults = () => {
const now = Date.now();
const twoWeeksAgo = new Date(now - (1000 * 60 * 60 * 24 * 7 * 2 ));
const fiveWeeksAgo = new Date(now - (1000 * 60 * 60 * 24 * 7 * 5 ));
// as long a mongoDriverCollectionReference points to a "Collection" object
// for the "users" collection
return mongoDriverCollectionReference.aggregate([
// No $expr, since you can actually use an index. $expr cannot do that
{ "$match": {
"$or": [
// Active and "logged in"/created in the last 2 weeks
{
"isActive": true,
"$or": [
{ "lastLogin": { "$gte": twoWeeksAgo } },
{ "createdOn": { "$gte": twoWeeksAgo } }
]
},
// Also want those who...
// Not Active and "logged in"/created in the last 5 weeks
// we'll "tag" them later
{
"isActive": false,
"$or": [
{ "lastLogin": { "$gte": fiveWeeksAgo } },
{ "createdOn": { "$gte": fiveWeeksAgo } }
]
}
]
}},
// Now we do the "conditional" stuff, just to return a matching result or not
{ "$lookup": {
"from": _.get(Config, `env.collection.projects`), // there are a lot cleaner ways to register models than this
"let": {
"uuid": {
"$cond": {
"if": "$isActive", // this is boolean afterall
"then": null, // don't really want to match
"else": "$uuid" // Okay to match the 5 week results
}
}
},
"pipeline": [
// Nothing complex here as null will return nothing. Just do $in for the array
{ "$match": { "$in": [ "$$uuid", "$users.uuid" ] } },
// Don't really need the detail, so just reduce any matches to one result of [null]
{ "$group": { "_id": null } }
],
"as": "projects"
}},
// Now test if the $lookup returned something where it mattered
{ "$match": {
"$or": [
{ "active": true }, // remember we selected the active ones already
{
"projects.0": { "$exists": false } // So now we only need to know the "inactive" returned no array result.
}
]
}}
]).toArray(); // returns a Promise
};
It's pretty simple as calculated expressions via $expr are actually really bad and not what you want in a first pipeline stage. Also "not what you need" since createdOn and lastLogin really should not have been merged into an array for $allElementsTrue which would just be an AND condition, where you described logic would really mean OR. So the $or does just fine here.
So does the $or on the separation of conditions for the isActive of true/false. Again it's either "two weeks" OR "five weeks". And this certainly does not need $expr since standard inequality range matching works fine, and uses an "index".
Then you really just want to do the "conditional" things in the let for $lookup instead of your "does it exist" thinking. All you really need to know ( since the range selection of dates is actually already done ) is whether active is now true or false. Where it's active ( meaning by your logic you don't care about projects ) simply make the $$uuid used within the $match pipeline stage a null value so it will not match and the $lookup returns an empty array. Where false ( also already matching the date conditions from earlier ) then you use the actual value and "join" ( where there are projects of course ).
Then it's just a simple matter of keeping the active users, and then only testing the remaining false values for active to see if the "projects" array from the $lookup actually returned anything. If it did not, then they just don't have projects.
Probably should note here is since users is an "array" within the projects collection, you use $in for the $match condition against the single value to the array.
Note that for brevity we can use $group inside the inner pipeline to only return one result instead of possibly many matches to actual matched projects. You don't care about the content or the "count", but simply if one was returned or nothing. Again following the presented logic.
This gets you your desired results, and it does so in a manner that is efficient and actually uses indexes where available.
Also return await certainly does not do what you think it does, and in fact it's an ESLint warning message ( I suggest you enable ESLint in your project ) since it's not a smart thing to do. It does nothing really, as you would need to await getResults() ( as per the example naming ) anyway, as the await keyword is not "magic" but just a prettier way of writing then(). As well as hopefully being easier to understand, once you understand what async/await is really for syntactically that is.

How to make a query using Mongoose that gets N results, but combines any documents it finds that meet certain criteria?

I have a Comments collection in Mongoose, and a query that returns the most recent five (an arbitrary number) Comments.
Every Comment is associated with another document. What I would like to do is make a query that returns the most recent 5 comments, with comments associated with the same other document combined.
So instead of a list like this:
results = [
{ _id: 123, associated: 12 },
{ _id: 122, associated: 8 },
{ _id: 121, associated: 12 },
{ _id: 120, associated: 12 },
{ _id: 119, associated: 17 }
]
I'd like to return a list like this:
results = [
{ _id: 124, associated: 3 },
{ _id: 125, associated: 19 },
[
{ _id: 123, associated: 12 },
{ _id: 121, associated: 12 },
{ _id: 120, associated: 12 },
],
{ _id: 122, associated: 8 },
{ _id: 119, associated: 17 }
]
Please don't worry too much about the data format: it's just a sketch to try to show the sort of thing I want. I want a result set of a specified size, but with some results grouped according to some criterion.
Obviously one way to do this would be to just make the query, crawl and modify the results, then recursively make the query again until the result set is as long as desired. That way seems awkward. Is there a better way to go about this? I'm having trouble phrasing it in a Google search in a way that gets me anywhere near anyone who might have insight.

Here's an aggregation pipeline query that will do what you are asking for:
db.comments.aggregate([
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 }
])
Lacking any other fields from the sample data to sort by, I used $_id.
If you'd like results that are a little closer in structure to the sample result set you provided you could add a $project to the end:
db.comments.aggregate([
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 },
{ $project: { _id: 0, cohorts: 1 }}
])
That will print only the result set. Note that even comments that do not share an association object will be in an array. It will be an array of 1 length.
If you are concerned about limiting the results in the grouping as Neil Lunn is suggesting, perhaps a $match in the beginning is a smart idea.
db.comments.aggregate([
{ $match: { createDate: { $gte: new Date(new Date() - 5 * 60000) } } },
{ $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
{ $sort: { "maxID": -1 } },
{ $limit: 5 },
{ $project: { _id: 0, cohorts: 1 }}
])
That will only include comments made in the last 5 minutes assuming you have a createDate type field. If you do, you might also consider using that as the field to sort by instead of "_id". If you do not have a createDate type field, I'm not sure how best to limit the comments that are grouped as I do not know of a "current _id" in the way that there is a "current time".

I honestly think you are asking a lot here and cannot really see the utility myself, but I'm always happy to have that explained to me if there is something useful I have missed.
Bottom line is you want comments from the last five distinct users by date, and then some sort of grouping of additional comments by those users. The last part is where I see difficulty in rules no matter how you want to attack this, but I'll try to keep this to the most brief form.
No way this happens in a single query of any sort. But there are things that can be done to make it an efficient server response:
var DataStore = require('nedb'),
store = new DataStore();
async.waterfall(
function(callback) {
Comment.aggregate(
[
{ "$match": { "postId": thisPostId } },
{ "$sort": { "associated": 1, "createdDate": -1 } },
{ "$group": {
"_id": "$associated",
"date": { "$first": "$createdDate" }
}},
{ "$sort": { "date": -1 } },
{ "$limit": 5 }
],
callback);
},
function(docs,callback) {
async.each(docs,function(doc,callback) {
Comment.aggregate(
[
{ "$match": { "postId": thisPostId, "associated": doc._id } },
{ "$sort": { "createdDate": -1 } },
{ "$limit": 5 },
{ "$group": {
"_id": "$associated",
"docs": {
"$push": {
"_id": "$_id", "createdDate": "$createdDate"
}
},
"firstDate": { "$first": "$createdDate" }
}}
],
function(err,results) {
if (err) callback(err);
async.each(results,function(result,callback) {
store.insert( result, function(err, result) {
callback(err);
});
},function(err) {
callback(err);
});
}
);
},
callback);
},
function(err) {
if (err) throw err;
store.find({}).sort({ "firstDate": - 1 }).exec(function(err,docs) {
if (err) throw err;
console.log( JSON.stringify( docs, undefined, 4 ) );
});
}
);
Now I stuck more document properties in both the document and the array, but the simplified form based on your sample would then come out like this:
results = [
{ "_id": 3, "docs": [124] },
{ "_id": 19, "docs": [125] },
{ "_id": 12, "docs": [123,121,120] },
{ "_id": 8, "docs": [122] },
{ "_id": 17, "docs": [119] }
]
So the essential idea is to first find your distinct "users" who where the last to comment by basically chopping off the last 5. Without filtering some kind of range here that would go over the entire collection to get those results, so it would be best to restrict this in some way, as in the last hour or last few hours or something sensible as required. Just add those conditions to the $match along with the current post that is associated with the comments.
Once you have those 5, then you want to get any possible "grouped" details for multiple comments by those users. Again, some sort of limit is generally advised for a timeframe, but as a general case this is just looking for the most recent comments by the user on the current post and restricting that to 5.
The execution here is done in parallel, which will use more resources but is fairly effective considering there are only 5 queries to run anyway. In contrast to your example output, the array here is inside the document result, and it contains the original document id values for each comment for reference. Any other content related to the document would be pushed into the array as well as required (ie The content of the comment).
The other little trick here is using nedb as a means for storing the output of each query in an "in memory" collection. This need only really be a standard hash data structure, but nedb gives you a way of doing that while maintaining the MongoDB statement form that you may be used to.
Once all results are obtained you just return them as your output, and sorted as shown to retain the order of who commented last. The actual comments are grouped in the array for each item and you can traverse this to output how you like.
Bottom line here is that you are asking for a compounded version of the "top N results problem", which is something often asked of MongoDB. I've written about ways to tackle this before to show how it's possible in a single aggregation pipeline stage, but it really is not practical for anything more than a relatively small result set.
If you really want to join in the insanity, then you can look at Mongodb aggregation $group, restrict length of array for one of the more detailed examples. But for my money, I would run on parallel queries any day. Node.js has the right sort of environment to support them, so you would be crazy to do it otherwise.

Query and sum all with mongoose

I want to fetch all users user_totaldocs and user_totalthings and want to sum those variables.
How can it's possible? Here is user schema:
var user_schema = mongoose.Schema({
local : {
...
...
user_id : String,
user_totaldocs : Number,
user_totalthings : Number
....
}
});

You can use the Aggregation Pipeline to add calculated fields to a result. There are some examples below using the mongo shell, but the syntax in Mongoose's Aggregate() helper is similar.
For example, to calculate sums (per user document) you can use the $add expression in a $project stage:
db.user.aggregate(
// Limit to relevant documents and potentially take advantage of an index
{ $match: {
user_id: "foo"
}},
{ $project: {
user_id: 1,
total: { $add: ["$user_totaldocs", "$user_totalthings"] }
}}
)
To calculate totals across multiple documents you need to use a $group stage with a $sum accumulator, for example:
db.user.aggregate(
{ $group: {
_id: null,
total: { $sum: { $add: ["$user_totaldocs", "$user_totalthings"] } },
totaldocs: { $sum: "$user_totaldocs" },
totalthings: { $sum: "$user_totalthings" }
}}
)
You may want only the one total field; I've added in totaldocs and totalthings as examples of calculating multiple fields.
A group _id of null will combine values from all documents passed to the $group stage, but you can also use other criteria here (such as grouping by user_id).

You can use aggregation framework provided by mongodb. For your case --
if you want to fetch sum of user_totaldocs and sum of user_totalthings across the collection (meaning for all users), do --
db.user_schemas.aggregate(
[
{
$group : {
user_id : null,
user_totaldocs: { $sum: "$user_totaldocs"}, // for your case use local.user_totaldocs
user_totalthings: { $sum: "$user_totalthings" }, // for your case use local.user_totalthings
count: { $sum: 1 } // for no. of documents count
}
}
])
To sum user_totaldocs and user_totalthings for particular user in a collection(assuming there are multiple document for a user), this will return sum for each user, DO --
db.user_schemas.aggregate(
[
{
$group : {
user_id : "$user_id",
user_totaldocs: { $sum: "$user_totaldocs"}, // for your case use local.user_totaldocs
user_totalthings: { $sum: "$user_totalthings" }, // for your case use local.user_totalthings
count: { $sum: 1 } // for no. of documents count
}
}
])
No need to provide individual user id.
For more info read:
1. http://docs.mongodb.org/manual/reference/operator/aggregation/group/#pipe._S_group
2. http://docs.mongodb.org/manual/core/aggregation/

mongo cursor timeout

I am trying to aggregate some records in a mongo database using the node driver. I am first matching to org, fed, and sl fields (these are indexed). If I only include a few companies in the array that I am matching the org field to, the query runs fine and works as expected. However, when including all of the clients in the array, I always get:
MongoError: getMore: cursor didn't exist on server, possible restart or timeout?
I have tried playing with the allowDiskUse, and the batchSize settings, but nothing seems to work. With all the client strings in the array, the aggregation runs for ~5hours before throwing the cursor error. Any ideas? Below is the pipeline along with the actual aggregate command.
setting up the aggregation pipeline:
var aggQuery = [
{
$match: { //all clients, from last three days, and scored
org:
{ $in : array } //this is the array I am talking about
,
frd: {
$gte: _.last(util.lastXDates(3))
},
sl : true
}
}
, {
$group: { //group by isp and make fields for calculation
_id: "$gog",
count: {
$sum: 1
},
countRisky: {
$sum: {
$cond: {
if :{
$gte: ["$scr", 65]
},
then: 1,
else :0
}
}
},
countTimeZoneRisky: {
$sum: {
$cond: {
if :{
$eq: ["$gmt", "$gtz"]
},
then: 0,
else :1
}
}
}
}
}
, {
$match: { //show records with count >= 500
count: {
$gte: 500
}
}
}
, {
$project: { //rename _id to isp, only show relevent fields
_id: 0,
ISP: "$_id",
percentRisky: {
$multiply: [{
$divide: ["$countRisky", "$count"]
},
100
]
},
percentTimeZoneDiscrancy: {
$multiply: [{
$divide: ["$countTimeZoneRisky", "$count"]
},
100
]
},
count: 1
}
}
, {
$sort: { //sort by percent risky and then by count
percentRisky: 1,
count: 1
}
}
];
Running the aggregation:
var cursor = reportingCollections.hitColl.aggregate(aggQuery, {
allowDiskUse: true,
cursor: {
batchSize: 40000
}
});
console.log('Writing data to csv ' + currentFileNamePrefix + '!');
//iterate through cursor and write documents to CSV
cursor.each(function (err, document) {
//write each document to csv file
//maybe start a nuclear war
});

You're calling the aggregate method which doesn't return the cursor by default (like e.g. find()). To return query as a cursor, you must add the cursor option in the options. But, the timeout setting for the aggregation cursor is (currently) not supported. The native node.js driver only supports the batchSize setting.
You would set the batchOption like this:
var cursor = coll.aggregate(query, {cursor: {batchSize:100}}, writeResultsToCsv);

To circumvent such problems, I'd recommend aggregation or map-reduce directly through mongo client. There you can add the notimeout option.
The default timeout is 10 minutes (obviously useless for long time-consuming queries) and there's no way currently to set a different one as far as I know, only infinite by aforementioned option. The timeout hits you especially for high batch sizes, because it will take more than 10 mins to process the incoming docs and before you ask mongo server for more, the cursor has been deleted.
IDK your use case, but if it's a web view, there should be only fast queries/aggregations.
BTW I think this didn't change with 3.0.*

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Get top 100 documents based on multiple fields - node.js

sort should be written with a dollar sign. $sort: { orderBySumValue: 1 }

Related

How can I implement Auto calculated field in mongodb?

Use $lookup with a Conditional Join

How to make a query using Mongoose that gets N results, but combines any documents it finds that meet certain criteria?

Query and sum all with mongoose

mongo cursor timeout

Categories

Resources