Mongodb aggregation - geoNear and text search in joined collection - node.js

I have a tricky query that hits my MongoDB know-how. Here the simplified szenario.
We have a collection Restaurant and a collection Subsidary.
They look roughly like this (simplified - using mongoose):
const restaurantSchema = new Schema(
{
name: { type: String, required: true },
categories: { type: [String], required: true },
...
})
const subsidarySchema = new Schema(
{
restaurant: { type: Schema.Types.ObjectId, ref: 'Restaurant' },
location: {
type: { type: String, enum: ['Point'], required: true },
coordinates: { type: [Number], required: true },
},
...
})
What is required:
Always: Find restaurants that have a subsidary within 3.5 KM radius and sort by distance.
Sometimes filter those restaurants also by a string that should fuzy-match the Restaurant name.
Apply further filters and pagination (e.g. filter by categories, ...)
I'm trying to tackle this with a mongodb aggregation. The problem:
The aggregation pipeline stages geoNear and text require each to be first in the pipeline - which means they exclude each other.
Here my thought so far:
Start aggregation with subsidary, $geoNear stage first. This cuts away already all restaurants outside the 3.5 KM.
$group the subsidaries by restaurant and keep the minimal distance value per cluster.
$lookup to get the matchin restaurant for each cluster. Maybe $unwind here.
??? Here the text/search match should be, fuzy-matching the restaurants' name. ???
$match for other values (category, openingHours, ...)
$sort and $limit and $skip for sorting andd pagination.
Here the same as illustration.
Question
Does this approach make sense? What would be a possible way to implement stage 4?
I was searching a lot but there seems no way to use something like { $match: { $text: { $search: req.query.name } } } as a 4th stage.
An alternative would be to run a second query before that just handles the text search and then build an intersection. This could lead to a massive amount of restaurant IDs being passed in that stage. Is that something mongodb could handle?
I'm very thankful for your comments!

Some ways around the requirement that both text search and geo query must be the first stage:
Use text search as the first stage, then manually calculate the distance using $set/$expr in a subsequent stage.
Use geo query as the first stage, then perform text filtering in your application (allowing you also to use any text matching/similarity algorithm you like).

Related

Specific aggregated query using mongoose / mongodb

I need help to achieve the following task.
I want to query all documents from a mongodb using mongoose of type TokenBalance. The schema looks like this
const Schema = mongoose.Schema(
{
address : { type: String, index: true },
ethervalue: {type: Number, required:true},
balances : []
},
{ timestamps: true }
);
The balances [] Array inside that schema holds multiple objects of this structure
{address: addresshash, symbol: someSymbol, balance: somebalance, usdvalue: usdvalueOfBalance}
What I need, is to query all Docs of Type TokenBalance within a timespan that is given and then from all of these documents sum the usdvalues grouped by symbols.
For example the output I need should look like this:
[
{symbol: BTC, balance: 100.000},
{symbol: ETC, balance: 120.000}
...
]
I'm having difficulties writing the correct aggregation for this. Especially I don't know how to group by the documents "balances" array and its symbols and sum the values.
I hope my question is clear and someone could help me.
Thank you in advance

how to sort ,index and paginate posts mongodb-mongoose

I have the following postSchema and would like to fetch datas depending on updatedAt field. When people make comment I increase numberofreply by one and its updatedAt is updated. How should I fetch datas for infinite scroll and should I use indexing for this operation ?
const postScheme = mongoose.Schema(
{
post: {
type: String,
trim: true,
},
numberOfReply: {
type: Number,
default: 0
},
owner: {
type: mongoose.Schema.Types.ObjectId,
ref: 'User'
},
hasImage: {
type: Boolean,
},
image: {
type: String,
trim: true
},
},
{timestamps: true}
)
this is what I use to fetch first page
Post.Post.find({}).sort({'updatedAt': -1}).limit(10).populate('owner').populate('coin').exec(function (err, posts) {
res.send(posts)
})
this is for infinite scroll
Post.Post.find({isCoin: true, updatedAt: {$lt: req.body.last}}).sort({'updatedAt': -1}).populate('owner').limit(
10).exec(function (err, posts) {
res.send(posts)
})
The limit, skip syntax is Mongo's way of paginating through data so you got that worked out, from a code perspective you can't really change anything to work better.
should I use indexing for this operation
Most definitely yes, indexes are the way to make this operation be efficient. otherwise Mongo will do a collection scan for each pagination which is very inefficient.
So what kind of index you should built? Well you want to build a compound index that will allow the query to both satisfy the query and the sort conditions, and in your case that is on the isCoin and updateAt fields, like so:
db.collection.createIndex( { isCoin: 1, updateAt: -1 } )
A few improvements you can make to make the index a bit more efficient (for this specific query) are:
Consider creating the index as a sparse index, this will only index documents with both fields in them, obviously if the data doesn't include this options you can ignore it.
This one has a few caveats in it, but partial indexes are designed for this case, to improve query performance by indexing a smaller subset of the data. and in your case you can add this option
{ partialFilterExpression: { isCoin: true } }
with that said this will limit your index usage for other queries so it might not be the ultimate choice for you.

Near Geometry with a Join

i'm truing to fetch result from my mongodb server, query: get cars that in nearest agency
this what i have tried but getting result without sorting
let cars = await Cars.find({disponible: true})
.populate({
path: 'agency',
match: {
"location": {
$near: {
$geometry: {
coordinates: [ latitude , longitude ]
},
}
}
},
select: 'name'
})
.select('name agency');
// send result via api
res.status(200).json({cars})
my schemas
//Car Schema
const carSchema = new Schema({
name: { type: String, required: true},
agency: {type: Schema.Types.ObjectId, ref: 'agencies'},
}, { timestamps: true });
//Agency Schema
const agencySchema = new Schema({
name: { type: String, required: true},
location: {
type: {
type: String,
enum: ['Point'],
default: 'Point'
},
coordinates: {
type: [Number],
required: true
}
},
}, { timestamps: true });
i want to get cars with agency but sorted by the nearest agency
Theres a reason populate() cannot work
Using populate() you won't be able to do this, and for a number of reasons. The main reason being that all populate() is doing is essentially marrying up your foreign reference to results from another collection with given query parameters.
In fact with a $near query, the results could be quite weird, since you might not receive enough "near" results to actually marry up with all the parent references.
There's a bit more detail about the "foreign constraint" limitation with populate() in existing answers to Querying after populate in Mongoose and of course on the modern solution to this, which is $lookup.
Using $lookup and $geoNear
In fact, what you need is a $lookup along with a $geoNear, but you also must do the "join" the other way around to what you might expect. And thus from the Agency model you would do:
Agency.aggregate([
// First find "near" agencies, and project a distance field
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": [ longitude , latitude ]
},
"distanceField": "distance",
"spherical" true
}},
// Then marry these up to Cars - which can be many
{ "$lookup": {
"from": Car.collection.name,
"let": { "agencyId": "$_id" },
"pipeline": [
{ "$match": {
"disponible": true,
"$expr": { "$eq": [ "$$agencyId", "$agency" ] }
}}
],
"as": "cars"
}},
// Unwinding denormalizes that "many"
{ "$unwind": "$cars" },
// Group is "inverting" the result
{ "$group": {
"_id": "$cars._id",
"car": { "$first": "$cars" },
"agency": {
"$first": {
"$arrayToObject": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$ne": [ "$$this.k", "cars" ] }
}
}
}
}
}},
// Sort by distance, nearest is least
{ "$sort": { "agency.distance": 1 } },
// Reformat to expected output
{ "$replaceRoot": {
"newRoot": {
"$mergeObjects": [ "$car", { "agency": "$agency" } ]
}
}}
])
As stated the $geoNear part must come first. Bottom line is it basically needs to be the very first stage in an aggregation pipeline in order to use the mandatory index for such a query. Though it is true that given the form of $lookup shown here you "could" actually use a $near expression within the $lookup pipeline with a starting $match stage, it won't return what you expect since basically the constraint is already on the matching _id value. And it's really just the same problem populate() has in that regard.
And of course though $geoNear has a "query" constraint, you cannot use $expr within that option so this rules out that stage being used inside the $lookup pipeline again. And yes, still basically the same problem of conflicting constraints.
So this means you $geoNear from your Agency model instead. This pipeline stage has the additional thing it does which is it actually projects a "distanceField" into the result documents. So a new field within the documents ( called "distance" in the example ) will then indicate how far away from the queried point the matched document is. This is important for sorting later.
Of course you want this "joined" to the Car, so you want to do a $lookup. Note that since MongoDB has no knowledge of mongoose models the $lookup pipeline stage expects the "from" to be the actual collection name on the server. Mongoose models typically abstract this detail away from you ( though it's normally the plural of the model name, in lowercase ), but you can always access this from the .collection.name property on the model as shown.
The other arguments are the "let" in which you keep a reference to the _id of the current Agency document. This is used within the $expr of the $match in order to compare the local and foreign keys for the actual "joining" condition. The other constraints in the $match further filter down the matching "cars" to those criteria as well.
Now it's probably likely there are in fact many cars to each agency and that is one basic reason the model has been done like this in separate collections. Regardless of whether it's one to one or one to many, the $lookup result always produces an array. Basically we now want this array to "denormalize" and essentially "copy" the Agency detail for each found Car. This is where $unwind comes in. An added benefit is that when you $unwind the array of matching "cars", any empty array where the contraints did not match anything effectively removes the Agency from the possible results altogether.
Of course this is the the wrong way around from how you actually want the results, as it's really just "one car" with "one agency". This is where $group comes in and collects information "per car". Since this way around it is expected as "one to one", the $first operator is used as an accumulator.
There is a fancy expression in there with $objectToArray and $arrayToObject, but really all that is doing is removing the "cars" field from the "agency" content, just as the "$first": "$cars" is keeping that data separate.
Back to something closer to the desired output, the other main thing is to $sort the results so the "nearest" results are the ones listed first, just as the initial goal was all along. This is where you actually use the "distance" value which was added to the document in the original $geoNear stage.
At this point you are nearly there, and all that is needed is to reform the document into the expected output shape. The final $replaceRoot does this by taking the "car" value from the earlier $group output and promoting it to the top level object to return, and "merging" in the "agency" field to appear as part of the Car itself. Clearly $mergeObjects does the actual "merging".
That's it. It does work, but you may have spotted the problem that you don't actually get to say "near to this AND with this other constraint" technically as part of a single query. And a funny thing about "nearest" results is they do have an in-buit "limit" on results they should return.
And that is basically in the next topic to discuss.
Changing the Model
Whilst all the above is fine, it's still not really perfect and has a few problems. The most notable problem should be that it's quite complex and that "joins" in general are not good for performance.
The other considerable flaw is that as you might have gathered from the "query" parameter on the $geoNear stage, you are not really getting the equivalent of both conditions ( find nearest agency to AND car has disponible: true ) since on separate collections the initial "near" does not consider the other constraint.
Nor can this even be done from the original order just as was intended, and again comes back to the problem with populate() here.
So the real issue unfortunately is design. And it may be a difficult pill to swallow, but the current design which is extremely "relational" in nature is simply not a good fit for MongoDB in how it would handle this type of operation.
The core problem is the "join", and in order to make things work we basically need to get rid of it. And you do that in MongoDB design by embedding the document instead of keeping a reference in another collection:
const carSchema = new Schema({
name: { type: String, required: true},
agency: {
name: { type: String, required: true},
location: {
type: {
type: String,
enum: ['Point'],
default: 'Point'
},
coordinates: {
type: [Number],
required: true
}
}
}
}, { timestamps: true });
In short "MongoDB is NOT a relational database", and it also does not really "do joins" as the sort of itegral constraint over a join you are looking for simply is not supported.
Well, it's not supported by $lookup and the ways it will do things, but the official line has been and will always be that a "real join" in MongoDB is embedded detail. Which simply means "if it's meant to be a constraint on queries you want to do, then it belongs in the same document".
With that redesign the query simply becomes:
Car.find({
disponible: true,
"agency.location": {
$near: {
$geometry: {
coordinates: [ latitude , longitude ]
},
}
}
})
YES, that would mean that you likely duplicate a lot of information about an "agency" since the same data would likely be present on many cars. But the facts are that for this type of query usage, this is actually what MongoDB is expecting you to model as.
Conclusion
So the real choices here come down to which case suits your needs:
Accept that you are possibly returning less than the expected results due to "double filtering" though the use of a $geoNear and $lookup combination. Noting that $geoNear will only return 100 results by default, unless you change that. This can be an unreliable combination for "paged" results.
Restructure your data accepting the "duplication" of agency detail in order to get a proper "dual constraint" query since both criteria are in the same collection. It's more storage and maintenance, but it is more performant and completely reliable for "paged" results.
And of course if it's neither acceptable to use the aggregation approach shown or the restructure of data, then this can only show that MongoDB is probably not best suited to this type of problem, and you would be better off using an RDBMS where you decide you must keep normalized data as well as be able to query with both constraints in the same operation. Provided of course you can choose an RDBMS which actually supports the usage of such GeoSpatial queries along with "joins".

Aggregate and flatten an array field in MongoDB

I have a Schema:
var ProjectSchema = new Schema({
name: {
type: String,
default: ''
},
topics: [{
type: Schema.ObjectId,
ref: 'Topic'
}],
user: {
type: Schema.ObjectId,
ref: 'User'
}
});
What I want to do is get an array with all topics from all projects. I cannot query Topic directly and get a full list because some topics are unassigned and they do not hold a reference back to a Project (for reasons of avoiding two way references). So I need to query Project and aggregate some how. I am doing something like:
Project.aggregate([{$project:{topics:1}}]);
But this is giving me an array of Project objects with the topics field. What I want is an array with topic objects.
How can I do this?
When dealing with arrays you typically want to use $unwind on the array members first and then $group to find the distinct entries:
Project.aggregate(
[
{ "$unwind": "$topics" },
{ "$group": { "_id": "$topics._id" } }
],
function(err,docs) {
}
)
But for this case, it is probably simplier to just use .distinct() which will do the same as above, but with just an array of results rather than documents:
Project.distinct("topics._id",function(err,topics) {
});
But wait there a minute because I know what you are really asking here. It's not the _id values you want but your Topic data has a property on it like "name".
Since your items are "referenced" and in another collection, you cannot do an aggregation pipeline or .distinct() operation on the property of a document in another collection. Put basically "MongoDB does not perform Joins" and mongoose .populate() is not a join, just something that "emulates" that with additional query(ies).
But you can of course just find the "distinct" values from "Project" and then fetch the information from "Topic". As in:
Project.distinct("topics._id",function(err,topics) {
Topic.find({ "_id": { "$in": topics } },function(err,topics) {
});
});
Which is handy because the .distinct() function already returned an array suitable for use with $in.

How to calculate Rating in my MongoDB design

I'm creating a system that users can write review about an item and rate it from 0-5. I'm using MongoDB for this. And my problem is to find the best solution to calculate the total rating in product schema. I don't think querying all comments to get the size and dividing it by total rating is a good solution. Here is my Schema. I appreciate any advice:
Comments:
var commentSchema = new Schema({
Rating : { type: Number, default:0 },
Helpful : { type: Number, default:0 },
User :{
type: Schema.ObjectId,
ref: 'users'
},
Content: String,
});
Here is my Item schema:
var productSchema = new Schema({
//id is barcode
_id : String,
Rating : { type: Number, default:0 },
Comments :[
{
type: Schema.ObjectId,
ref: 'comments'
}
],
});
EDIT: HERE is the solution I got from another topic : calculating average in Mongoose
You can get the total using the aggregation framework. First you use the $unwind operator to turn the comments into a document stream:
{ $unwind: "$Comments" }
The result is that for each product-document is turned into one product-document per entry in its Comments array. That comment-entry is turned into a single object under the field Comments, all other fields are taken from the originating product-document.
Then you use $group to rejoin the documents for each product by their _id, while you use the $avg operator to calculate the average of the rating-field:
{ $group: {
_id: "$_id",
average: { $avg: "$Comments.Rating" }
} }
Putting those two steps into an aggregation pipeline calculates the average rating for every product in your collection. You might want to narrow it down to one or a small subset of products, depending on what the user requested right now. To do this, prepend the pipeline with a $match step. The $match object works just like the one you pass to find().
The underlying question that it would be useful to understand is why you don't think that finding all of the ratings, summing them up, and dividing by the total number is a useful approach. Understanding the underlying reason would help drive a better solution.
Based on the comments below, it sounds like your main concern is performance and the need to run map-reduce (or another aggregation framework) each time a user wants to see total ratings.
This person addressed a similar issue here: http://markembling.info/2010/11/using-map-reduce-in-a-mongodb-app
The solution they identified was to separate out the execution of the map-reduce function from the need in the view to see the total value. In this case, the optimal solution would be to run the map-reduce periodically and store the results in another collection, and have the average rating based on the collection that stores the averages, rather than doing the calculation in real-time each time.
As I mentioned in the previous version of this answer, you can improve performance further by limiting the map-reduce to addresing ratings that were created or updated more recently, or since the last map-reduce aggregation.

Resources