How can I optimize this query in mongo db? - node.js

Here is the query:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.

Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.
When it comes to performance, there is nothing that can be done to improve efficiency on the positive collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.
To efficiently support the $lookup on the search_history collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 } exists. Providing the .explain("allPlansExecution") output would allow us to better understand the current performance characteristics.
Desired Logic
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:
For every single document in the positive collection, add a new field called history.
This new field is a list of url values from the search_history collection where the corresponding document has a matching search_id value and was created_at after last Sunday.
The aggregation then filters to only keep documents where the new history field has at least one entry.
The next stage then groups the results together by word. The $addToSet operator is used here, but it may be generating an array of arrays rather than de-duplicated urls.
The final 3 stages of the aggregation seem to be focused on calculating the number of urls and returning the top 50 results by word sorted on that size in descending order.
Is this what you want? In particular the following aspects may be worth confirming:
Is it your intention to process every document in the positive collection? This may be the case, but it's impossible to tell without any schema/use-case context.
Is the size calculation of the urls correct? It seems like you may need to use a $map when doing the $addToSet for the $group instead of using $reduce for the subsequent $project.

The best thing to do is to limit the number of documents passed to each stage.
Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.
So the best thing to do is to have a match on an indexed field that is very restrictive.
Moreover, please note that $limit, $skip and $sample are not panaceas because they still scan the entire collection.
A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this :
Once every X requests
Count the number of docs in the collection
Divide this in chunks of Yk max
Find the _ids of the docs at the place Y, 2Y, 3Y etc with skip and limit
Cache the results in redis/memcache (or as global variable if you really cannot do otherwise)
Every request
Get the current chunk to scan by reading the redis keys used and nbChunks
Get the _ids cached in redis used to delimit the next aggregation id:${used%nbChunks} and id:${(used%nbChunks)+1} respectively
Aggregate using $match with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
Increment used, if used > X then update chunks
Further optimisation
If using redis, supplement every key with ${cluster.worker.id}:to avoid hot keys.
Notes
The step 3) of the setup of chunks can be a really long and intensive process, so do it only when necessary, let's say every X~1k requests.
If you are scanning the last chunk, do not put the $lt
Once this process implemented, your job is to find the sweet spot of X and Y that suits your needs, constrained by a Y being large enough to retrieve max documents while being not too long and a X that keeps the chunks roughly equals as the collection has more and more documents.
This process is a bit long to implement but once it is, time complexity is ~O(Y) and not ~O(N). Indeed, the $match being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.
Hope it help =) Make sure to ask more if needed =)

Related

MongoDB aggregation lookup with pagination is working slow in huge amount of data

I've a collection with more than 150 000 documents in MongoDB. I'm using Mongoose ODM v5.4.2 for MongoDB in Node.js. At the time of data retrieving I'm using Aggregation lookup with $skip and $limit for pagination. My code is working fine but after 100k documents It's taking 10-15 seconds to retrieve data. But I'm showing only 100 records at a time with the help of $skip and $limit. I've already created index for foreignField. But still it's getting slow.
campaignTransactionsModel.aggregate([{
$match: {
campaignId: new importModule.objectId(campaignData._id)
}
},
{
$lookup: {
from: userDB,
localField: "userId",
foreignField: "_id",
as: "user"
},
},
{
$lookup: {
from: 'campaignterminalmodels',
localField: "terminalId",
foreignField: "_id",
as: "terminal"
},
},
{
'$facet': {
edges: [{
$sort: {
[sortBy]: order
}
},
{ $skip: skipValue },
{ $limit: viewBy },
]
}
}
]).allowDiskUse(true).exec(function(err, docs) {
console.log(docs);
});
The query is taking longer because the server scans from beginning of input results(before skip stage) to skip the given number of docs and set the new result.
From official MongoDB docs :
The cursor.skip() method requires the server to scan from the
beginning of the input results set before beginning to return results.
As the offset increases, cursor.skip() will become slower.
You can use range queries to simulate similar result as of .skip() or skip stage(aggregation)
Using Range Queries
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using cursor.skip() for pagination.
Descending Order
Use this procedure to implement pagination with range queries:
Choose a field such as _id which generally changes in a consistent
direction over time and has a unique index to prevent duplicate
values
Query for documents whose field is less than the start value
using the $lt and cursor.sort() operators, and
Store the last-seen field value for the next query.
Increasing Order
- Query for documents whose field is less than the start value
using the $gt and cursor.sort() operators, and
Lets say the last doc you got has _id : objectid1, then you can query the docs who have _id : {$lt : objectid1} to get the docs in decreasing order. and for incresing order you can query the docs who have _id : {$gt : objectid1}
Read official docs on Range queries for more information.

Upsert and $inc Sub-document in Array

The following schema is intended to record total views and views for a very specific day only.
const usersSchema = new Schema({
totalProductsViews: {type: Number, default: 0},
productsViewsStatistics: [{
day: {type: String, default: new Date().toISOString().slice(0, 10), unique: true},
count: {type: Number, default: 0}
}],
});
So today views will be stored in another subdocument different from yesterday. To implement this I tried to use upsert so as subdocument will be created each day when product is viewed and counts will be incremented and recorded based on a particular day. I tried to use the following function but seems not to work the way I intended.
usersSchema.statics.increaseProductsViews = async function (id) {
//Based on day only.
const todayDate = new Date().toISOString().slice(0, 10);
const result = await this.findByIdAndUpdate(id, {
$inc: {
totalProductsViews: 1,
'productsViewsStatistics.$[sub].count': 1
},
},
{
upsert: true,
arrayFilters: [{'sub.day': todayDate}],
new: true
});
console.log(result);
return result;
};
What do I miss to get the functionality I want? Any help will be appreciated.
What you are trying to do here actually requires you to understand some concepts you may not have grasped yet. The two primary ones being:
You cannot use any positional update as part of an upsert since it requires data to be present
Adding items into arrays mixed with "upsert" is generally a problem that you cannot do in a single statement.
It's a little unclear if "upsert" is your actual intention anyway or if you just presumed that was what you had to add in order to get your statement to work. It does complicate things if that is your intent, even if it's unlikely give the finByIdAndUpdate() usage which would imply you were actually expecting the "document" to be always present.
At any rate, it's clear you actually expect to "Update the array element when found, OR insert a new array element where not found". This is actually a two write process, and three when you consider the "upsert" case as well.
For this, you actually need to invoke the statements via bulkWrite():
usersSchema.statics.increaseProductsViews = async function (_id) {
//Based on day only.
const todayDate = new Date().toISOString().slice(0, 10);
await this.bulkWrite([
// Try to match an existing element and update it ( do NOT upsert )
{
"updateOne": {
"filter": { _id, "productViewStatistics.day": todayDate },
"update": {
"$inc": {
"totalProductsViews": 1,
"productViewStatistics.$.count": 1
}
}
}
},
// Try to $push where the element is not there but document is - ( do NOT upsert )
{
"updateOne": {
"filter": { _id, "productViewStatistics.day": { "$ne": todayDate } },
"update": {
"$inc": { "totalProductViews": 1 },
"$push": { "productViewStatistics": { "day": todayDate, "count": 1 } }
}
}
},
// Finally attempt upsert where the "document" was not there at all,
// only if you actually mean it - so optional
{
"updateOne": {
"filter": { _id },
"update": {
"$setOnInsert": {
"totalProductViews": 1,
"productViewStatistics": [{ "day": todayDate, "count": 1 }]
}
}
}
])
// return the modified document if you really must
return this.findById(_id); // Not atomic, but the lesser of all evils
}
So there's a real good reason here why the positional filtered [<identifier>] operator does not apply here. The main good reason is the intended purpose is to update multiple matching array elements, and you only ever want to update one. This actually has a specific operator in the positional $ operator which does exactly that. It's condition however must be included within the query predicate ( "filter" property in UpdateOne statements ) just as demonstrated in the first two statements of the bulkWrite() above.
So the main problems with using positional filtered [<identifier>] are that just as the first two statements show, you cannot actually alternate between the $inc or $push as would depend on if the document actually contained an array entry for the day. All that will happen is at best no update will be applied when the current day is not matched by the expression in arrayFilters.
The at worst case is an actual "upsert" will throw an error due to MongoDB not being able to decipher the "path name" from the statement, and of course you simply cannot $inc something that does not exist as a "new" array element. That needs a $push.
That leaves you with the mechanic that you also cannot do both the $inc and $push within a single statement. MongoDB will error that you are attempting to "modify the same path" as an illegal operation. Much the same applies to $setOnInsert since whilst that operator only applies to "upsert" operations, it does not preclude the other operations from happening.
Thus the logical steps fall back to what the comments in the code also describe:
Attempt to match where the document contains an existing array element, then update that element. Using $inc in this case
Attempt to match where the document exists but the array element is not present and then $push a new element for the given day with the default count, updating other elements appropriately
IF you actually did intend to upsert documents ( not array elements, because that's the above steps ) then finally actually attempt an upsert creating new properties including a new array.
Finally there is the issue of the bulkWrite(). Whilst this is a single request to the server with a single response, it still is effectively three ( or two if that's all you need ) operations. There is no way around that and it is better than issuing chained separate requests using findByIdAndUpdate() or even updateOne().
Of course the main operational difference from the perspective of code you attempted to implement is that method does not return the modified document. There is no way to get a "document response" from any "Bulk" operation at all.
As such the actual "bulk" process will only ever modify a document with one of the three statements submitted based on the presented logic and most importantly the order of those statements, which is important. But if you actually wanted to "return the document" after modification then the only way to do that is with a separate request to fetch the document.
The only caveat here is that there is the small possibility that other modifications could have occurred to the document other than the "array upsert" since the read and update are separated. There really is no way around that, without possibly "chaining" three separate requests to the server and then deciding which "response document" actually applied the update you wanted to achieve.
So with that context it's generally considered the lesser of evils to do the read separately. It's not ideal, but it's the best option available from a bad bunch.
As a final note, I would strongly suggest actually storing the the day property as a BSON Date instead of as a string. It actually takes less bytes to store and is far more useful in that form. As such the following constructor is probably the clearest and least hacky:
const todayDate = new Date(new Date().setUTCHours(0,0,0,0))

Near Geometry with a Join

i'm truing to fetch result from my mongodb server, query: get cars that in nearest agency
this what i have tried but getting result without sorting
let cars = await Cars.find({disponible: true})
.populate({
path: 'agency',
match: {
"location": {
$near: {
$geometry: {
coordinates: [ latitude , longitude ]
},
}
}
},
select: 'name'
})
.select('name agency');
// send result via api
res.status(200).json({cars})
my schemas
//Car Schema
const carSchema = new Schema({
name: { type: String, required: true},
agency: {type: Schema.Types.ObjectId, ref: 'agencies'},
}, { timestamps: true });
//Agency Schema
const agencySchema = new Schema({
name: { type: String, required: true},
location: {
type: {
type: String,
enum: ['Point'],
default: 'Point'
},
coordinates: {
type: [Number],
required: true
}
},
}, { timestamps: true });
i want to get cars with agency but sorted by the nearest agency
Theres a reason populate() cannot work
Using populate() you won't be able to do this, and for a number of reasons. The main reason being that all populate() is doing is essentially marrying up your foreign reference to results from another collection with given query parameters.
In fact with a $near query, the results could be quite weird, since you might not receive enough "near" results to actually marry up with all the parent references.
There's a bit more detail about the "foreign constraint" limitation with populate() in existing answers to Querying after populate in Mongoose and of course on the modern solution to this, which is $lookup.
Using $lookup and $geoNear
In fact, what you need is a $lookup along with a $geoNear, but you also must do the "join" the other way around to what you might expect. And thus from the Agency model you would do:
Agency.aggregate([
// First find "near" agencies, and project a distance field
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": [ longitude , latitude ]
},
"distanceField": "distance",
"spherical" true
}},
// Then marry these up to Cars - which can be many
{ "$lookup": {
"from": Car.collection.name,
"let": { "agencyId": "$_id" },
"pipeline": [
{ "$match": {
"disponible": true,
"$expr": { "$eq": [ "$$agencyId", "$agency" ] }
}}
],
"as": "cars"
}},
// Unwinding denormalizes that "many"
{ "$unwind": "$cars" },
// Group is "inverting" the result
{ "$group": {
"_id": "$cars._id",
"car": { "$first": "$cars" },
"agency": {
"$first": {
"$arrayToObject": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$ne": [ "$$this.k", "cars" ] }
}
}
}
}
}},
// Sort by distance, nearest is least
{ "$sort": { "agency.distance": 1 } },
// Reformat to expected output
{ "$replaceRoot": {
"newRoot": {
"$mergeObjects": [ "$car", { "agency": "$agency" } ]
}
}}
])
As stated the $geoNear part must come first. Bottom line is it basically needs to be the very first stage in an aggregation pipeline in order to use the mandatory index for such a query. Though it is true that given the form of $lookup shown here you "could" actually use a $near expression within the $lookup pipeline with a starting $match stage, it won't return what you expect since basically the constraint is already on the matching _id value. And it's really just the same problem populate() has in that regard.
And of course though $geoNear has a "query" constraint, you cannot use $expr within that option so this rules out that stage being used inside the $lookup pipeline again. And yes, still basically the same problem of conflicting constraints.
So this means you $geoNear from your Agency model instead. This pipeline stage has the additional thing it does which is it actually projects a "distanceField" into the result documents. So a new field within the documents ( called "distance" in the example ) will then indicate how far away from the queried point the matched document is. This is important for sorting later.
Of course you want this "joined" to the Car, so you want to do a $lookup. Note that since MongoDB has no knowledge of mongoose models the $lookup pipeline stage expects the "from" to be the actual collection name on the server. Mongoose models typically abstract this detail away from you ( though it's normally the plural of the model name, in lowercase ), but you can always access this from the .collection.name property on the model as shown.
The other arguments are the "let" in which you keep a reference to the _id of the current Agency document. This is used within the $expr of the $match in order to compare the local and foreign keys for the actual "joining" condition. The other constraints in the $match further filter down the matching "cars" to those criteria as well.
Now it's probably likely there are in fact many cars to each agency and that is one basic reason the model has been done like this in separate collections. Regardless of whether it's one to one or one to many, the $lookup result always produces an array. Basically we now want this array to "denormalize" and essentially "copy" the Agency detail for each found Car. This is where $unwind comes in. An added benefit is that when you $unwind the array of matching "cars", any empty array where the contraints did not match anything effectively removes the Agency from the possible results altogether.
Of course this is the the wrong way around from how you actually want the results, as it's really just "one car" with "one agency". This is where $group comes in and collects information "per car". Since this way around it is expected as "one to one", the $first operator is used as an accumulator.
There is a fancy expression in there with $objectToArray and $arrayToObject, but really all that is doing is removing the "cars" field from the "agency" content, just as the "$first": "$cars" is keeping that data separate.
Back to something closer to the desired output, the other main thing is to $sort the results so the "nearest" results are the ones listed first, just as the initial goal was all along. This is where you actually use the "distance" value which was added to the document in the original $geoNear stage.
At this point you are nearly there, and all that is needed is to reform the document into the expected output shape. The final $replaceRoot does this by taking the "car" value from the earlier $group output and promoting it to the top level object to return, and "merging" in the "agency" field to appear as part of the Car itself. Clearly $mergeObjects does the actual "merging".
That's it. It does work, but you may have spotted the problem that you don't actually get to say "near to this AND with this other constraint" technically as part of a single query. And a funny thing about "nearest" results is they do have an in-buit "limit" on results they should return.
And that is basically in the next topic to discuss.
Changing the Model
Whilst all the above is fine, it's still not really perfect and has a few problems. The most notable problem should be that it's quite complex and that "joins" in general are not good for performance.
The other considerable flaw is that as you might have gathered from the "query" parameter on the $geoNear stage, you are not really getting the equivalent of both conditions ( find nearest agency to AND car has disponible: true ) since on separate collections the initial "near" does not consider the other constraint.
Nor can this even be done from the original order just as was intended, and again comes back to the problem with populate() here.
So the real issue unfortunately is design. And it may be a difficult pill to swallow, but the current design which is extremely "relational" in nature is simply not a good fit for MongoDB in how it would handle this type of operation.
The core problem is the "join", and in order to make things work we basically need to get rid of it. And you do that in MongoDB design by embedding the document instead of keeping a reference in another collection:
const carSchema = new Schema({
name: { type: String, required: true},
agency: {
name: { type: String, required: true},
location: {
type: {
type: String,
enum: ['Point'],
default: 'Point'
},
coordinates: {
type: [Number],
required: true
}
}
}
}, { timestamps: true });
In short "MongoDB is NOT a relational database", and it also does not really "do joins" as the sort of itegral constraint over a join you are looking for simply is not supported.
Well, it's not supported by $lookup and the ways it will do things, but the official line has been and will always be that a "real join" in MongoDB is embedded detail. Which simply means "if it's meant to be a constraint on queries you want to do, then it belongs in the same document".
With that redesign the query simply becomes:
Car.find({
disponible: true,
"agency.location": {
$near: {
$geometry: {
coordinates: [ latitude , longitude ]
},
}
}
})
YES, that would mean that you likely duplicate a lot of information about an "agency" since the same data would likely be present on many cars. But the facts are that for this type of query usage, this is actually what MongoDB is expecting you to model as.
Conclusion
So the real choices here come down to which case suits your needs:
Accept that you are possibly returning less than the expected results due to "double filtering" though the use of a $geoNear and $lookup combination. Noting that $geoNear will only return 100 results by default, unless you change that. This can be an unreliable combination for "paged" results.
Restructure your data accepting the "duplication" of agency detail in order to get a proper "dual constraint" query since both criteria are in the same collection. It's more storage and maintenance, but it is more performant and completely reliable for "paged" results.
And of course if it's neither acceptable to use the aggregation approach shown or the restructure of data, then this can only show that MongoDB is probably not best suited to this type of problem, and you would be better off using an RDBMS where you decide you must keep normalized data as well as be able to query with both constraints in the same operation. Provided of course you can choose an RDBMS which actually supports the usage of such GeoSpatial queries along with "joins".

How to calculate Rating in my MongoDB design

I'm creating a system that users can write review about an item and rate it from 0-5. I'm using MongoDB for this. And my problem is to find the best solution to calculate the total rating in product schema. I don't think querying all comments to get the size and dividing it by total rating is a good solution. Here is my Schema. I appreciate any advice:
Comments:
var commentSchema = new Schema({
Rating : { type: Number, default:0 },
Helpful : { type: Number, default:0 },
User :{
type: Schema.ObjectId,
ref: 'users'
},
Content: String,
});
Here is my Item schema:
var productSchema = new Schema({
//id is barcode
_id : String,
Rating : { type: Number, default:0 },
Comments :[
{
type: Schema.ObjectId,
ref: 'comments'
}
],
});
EDIT: HERE is the solution I got from another topic : calculating average in Mongoose
You can get the total using the aggregation framework. First you use the $unwind operator to turn the comments into a document stream:
{ $unwind: "$Comments" }
The result is that for each product-document is turned into one product-document per entry in its Comments array. That comment-entry is turned into a single object under the field Comments, all other fields are taken from the originating product-document.
Then you use $group to rejoin the documents for each product by their _id, while you use the $avg operator to calculate the average of the rating-field:
{ $group: {
_id: "$_id",
average: { $avg: "$Comments.Rating" }
} }
Putting those two steps into an aggregation pipeline calculates the average rating for every product in your collection. You might want to narrow it down to one or a small subset of products, depending on what the user requested right now. To do this, prepend the pipeline with a $match step. The $match object works just like the one you pass to find().
The underlying question that it would be useful to understand is why you don't think that finding all of the ratings, summing them up, and dividing by the total number is a useful approach. Understanding the underlying reason would help drive a better solution.
Based on the comments below, it sounds like your main concern is performance and the need to run map-reduce (or another aggregation framework) each time a user wants to see total ratings.
This person addressed a similar issue here: http://markembling.info/2010/11/using-map-reduce-in-a-mongodb-app
The solution they identified was to separate out the execution of the map-reduce function from the need in the view to see the total value. In this case, the optimal solution would be to run the map-reduce periodically and store the results in another collection, and have the average rating based on the collection that stores the averages, rather than doing the calculation in real-time each time.
As I mentioned in the previous version of this answer, you can improve performance further by limiting the map-reduce to addresing ratings that were created or updated more recently, or since the last map-reduce aggregation.

Mongoose/Mongodb previous and next in embedded document

I'm learning Mongodb/Mongoose/Express and have come across a fairly complex query (relative to my current level of understanding anyway) that I'm not sure how best to approach. I have a collection - to keep it simple let's call it entities - with an embedded actions array:
name: String
actions: [{
name: String
date: Date
}]
What I'd like to do is to return an array of documents with each containing the most recent action (or most recent to a specified date), and the next action (based on the same date).
Would this be possible with one find() query, or would I need to break this down into multiple queries and merge the results to generate one result array? I'm looking for the most efficient route possible.
Provided that your "actions" are inserted with the "most recent" being the last entry in the list, and usually this will be the case unless you are specifically updating items and changing dates, then all you really want to do is "project" the last item of the array. This is what the $slice projection operation is for:
Model.find({},{ "actions": { "$slice": -1 } },function(err,docs) {
// contains an array with the last item
});
If indeed you are "updating" array items and changing dates, but you want to query for the most recent on a regular basis, then you are probably best off keeping the array ordered. You can do this with a few modifiers such as:
Model.update(
{
"_id": ObjectId("541f7bbb699e6dd5a7caf2d6"),
},
{
"$push": { "actions": { "$each": [], "$sort": { "date": 1 } } }
},
function(err,numAffected) {
}
);
Which is actually more of a trick that you can do with the $sort modifier to simply sort the existing array elements without adding or removing. In versions prior to 2.6 you need the $slice "update" modifier in here as well, but this could be set to a value larger than the expected array elements if you did not actually want to restrict the possible size, but that is probably a good idea.
Unfortunately, if you were "updating" via a $set statement, then you cannot do this "sorting" in a single update statement, as MongoDB will not allow both types of operations on the array at once. But if you can live with that, then this is a way to keep the array ordered so the first query form works.
If it just seems too hard to keep an array ordered by date, then you can in fact retrieve the largest value my means of the .aggregate() method. This allows greater manipulation of the documents than is available to basic queries, at a little more cost:
Model.aggregate([
// Unwind the array to de-normalize as documents
{ "$unwind": "$actions" },
// Sort the contents per document _id and inner date
{ "$sort": { "_id": 1, "actions.date": 1 } },
// Group back with the "last" element only
{ "$group": {
"_id": "$_id",
"name": { "$last": "$name" },
"actions": { "$last": "$actions" }
}}
],
function(err,docs) {
})
And that will "pull apart" the array using the $unwind operator, then process with a next stage to $sort the contents by "date". In the $group pipeline stage the "_id" means to use the original document key to "collect" on, and the $last operator picks the field values from the "last" document ( de-normalized ) on that grouping boundary.
So there are various things that you can do, but of course the best way is to keep your array ordered and use the basic projection operators to simply get the last item in the list.

Resources