I have to implement search using search indexes in MongoDB Atlas as well as normal browse feature. This includes filtering, match, sort, skip, limit (pagination). I have made an aggregation pipeline to achieve all this.
First I push the search query to my pipeline, then match, then sort, then skip and finally the limit query.
Here's how it goes:
query = [];
query.push({
$search: {
index: 'default'
text: {
query: searchQuery
path: { }
}
}
});
query.push({
$sort: sort,
});
query.push({
$match: {
type: match
},
query.push({
$skip: skip
});
query.push({
$limit: perPage
});
let documents = await collection.aggregate(query);
The results I get so far are correct. However, for pagination, I also want to get the total count of documents. The count must take the "match" and "searchQuery" (if any) take into account.
I have tried $facet but it gives error $_internalSearchMongotRemote is not allowed to be used within a $facet stage
So, I see a few challenges with this query.
The $sort stage may not be needed. All search queries are sorted by relevance score by default. If you need to sort on some other criterion, then it may be appropriate.
The $match stage is probably not needed. What most people are looking for when they are trying to match is a compound filter. As you can see from the docs, a filter behaves like a $match in ordinary MongoDB. From where I am, your query should probably look something like:
If you would like a fast count of the documents returned, you need to use the new count operator. It is available on 4.4.11 and 5.0.4 clusters. You can read about it here.
query = [];
query.push({
$search: {
index: 'default'
"compound": {
"filter": [{
"text": {
"query": match,
"path": type
}
}],
"must": [{
"text": {
"query": searchQuery,
"path": { }
}
}]
},
"count": {
"type": "total"
}
}
});
query.push({
$skip: skip
});
query.push({
$limit: perPage
});
/* let's you use the count, can obviously project more fields */
query.push({
$project: "$$SEARCH_META"
});
let documents = await collection.aggregate(query);
[1]: https://docs.atlas.mongodb.com/reference/atlas-search/compound/#mongodb-data-filter
[2]: https://docs.atlas.mongodb.com/reference/atlas-search/counting/#std-label-count-ref
Related
I know there are a LOT of questions regarding this subject. And while most work, they are really poor in performance when there are millions of records.
I have a collection with 10,000,000 records.
At first I was using mongoose paginator v2 and it took around 8s to get each page, with no filtering and 25s when filtering. Fairly decent compared to the other answers I found googling around. Then I read about aggregate (in some question about the same here) and it was a marvel, 7 ms to get each page without filtering, no matter what page it is:
const pageSize = +req.query.pagesize;
const currentPage = +req.query.currentpage;
let recordCount;
ServiceClass.find().count().then((count) =>{
recordCount = count;
ServiceClass.aggregate().skip(currentPage).limit(pageSize).exec().then((documents) => {
res.status(200).json({
message: msgGettingRecordsSuccess,
serviceClasses: documents,
count: recordCount,
});
})
.catch((error) => {
res.status(500).json({ message: msgGettingRecordsError });
});
}).catch((error) => {
res.status(500).json({ message: "Error getting record count" });
});
What I'm having issues with is when filtering. aggregate doesn't really work like find so my conditions are not working. I read the docs about aggregate and tried with [ {$match: {description: {$regex: regex}}} ] inside aggregate as a start but it did not return anything.
This is my current working function for filtering and pagination (which takes 25s):
const pageSize = +req.query.pagesize;
const currentPage = +req.query.currentpage;
const filter = req.params.filter;
const regex = new RegExp(filter, 'i');
ServiceClass.paginate({
$or:[
{code: { $regex: regex }},
{description: { $regex: regex }},
]
},{limit: pageSize, page: currentPage}).then((documents)=>{
res.status(200).json({
message: msgGettingRecordsSuccess,
serviceClasses: documents
});
}).catch((error) => {
res.status(500).json({ message: "Error getting the records." });
});
code and description are both indexes. code is a unique index and description is just a normal index. I need to search for documents which contains a string either in code or description field.
What is the most efficient way to filter and paginate when you have millions of records?
Below code will get the paginated result from the database along with the count of total documents for that particular query simultaneously.
const pageSize = +req.query.pagesize;
const currentPage = +req.query.currentpage;
const skip = currentPage * pageSize - pageSize;
const query = [
{
$match: { $or: [{ code: { $regex: regex } }, { description: { $regex: regex } }] },
},
{
$facet: {
result: [
{
$skip: skip,
},
{
$limit: pageSize,
},
{
$project: {
createdAt: 0,
updatedAt: 0,
__v: 0,
},
},
],
count: [
{
$count: "count",
},
],
},
},
{
$project: {
result: 1,
count: {
$arrayElemAt: ["$count", 0],
},
},
},
];
const result = await ServiceClass.aggregate(query);
console.log(result)
// result is an object with result and count key.
Hope it helps.
The most efficient way to filter and paginate when you have millions of records is to use the MongoDB's built-in pagination and filtering features, such as the skip(), limit(), and $match operators in the aggregate() pipeline.
You can use the skip() operator to skip a certain number of documents, and the limit() operator to limit the number of documents returned. You can also use the $match operator to filter the documents based on certain conditions.
To filter your documents based on the code or description field, you can use the $match operator with the $or operator, like this:
ServiceClass.aggregate([
{ $match: { $or: [{ code: { $regex: regex } }, { description: { $regex: regex } }] } },
{ $skip: currentPage },
{ $limit: pageSize }
])
You can also use the $text operator instead of $regex which will perform more efficiently when you have text search queries.
It's also important to make sure that the relevant fields (code and description) have indexes, as that will greatly speed up the search process.
You might have to adjust the query according to your specific use case and data.
I'm trying to update documents in a collection, and as per mongodb syntax, for some update function, I do this:
Collection.updateMany(
{
filter1: 'filter1-val',
filter2: 'filter2-val',
filter3: 'filter3-val'
},
{ $set: { filed: 'value' } }
)
and mongodb performs the given update operation on documents that match the query. However, it only updates documents that match every single filter in the query. My question, and what I'm wondering is there a way to perform the update on documents, as long as they match at least one of the filters in the query, and not every single one? So basically, a form of relaxed match, so that as long as a document meets one of the filters, then it should be updated.
Thanks in advance; any help is greatly appreciated.
You can use $expr and $or in your search criteria.
db.collection.find({
$expr: {
$or: [
{
$eq: [
"$filter1",
"filter1-val"
]
},
{
$eq: [
"$filter2",
"filter2-val"
]
},
{
$eq: [
"$filter3",
"filter3-val"
]
}
]
}
})
Here is a mongo playground for your reference.
You can use the $or operator.
Collection.updateMany(
{
$or: [
{ filter1: 'filter1-val' },
{ filter2: 'filter2-val' },
{ filter3: 'filter3-val' }
]
},
{ $set: { filed: 'value' } }
);
I have a pretty simple $lookup aggregation query like the following:
{'$lookup':
{'from': 'edge',
'localField': 'gid',
'foreignField': 'to',
'as': 'from'}}
When I run this on a match with enough documents I get the following error:
Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server
All attempts to limit the number of documents fail. allowDiskUse: true does nothing. Sending a cursor in does nothing. Adding in a $limit into the aggregation also fails.
How could this be?
Then I see the error again. Where did that $match and $and and $eq come from? Is the aggregation pipeline behind the scenes farming out the $lookup call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??
What is going on here?
As stated earlier in comment, the error occurs because when performing the $lookup which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.
The counter for this is to process with an $unwind which immediately follows the $lookup pipeline stage. This actually alters the behavior of $lookup in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.
Pretty much just like regular usage of $unwind, with the exception that instead of processing as a "separate" pipeline stage, the unwinding action is actually added to the $lookup pipeline operation itself. Ideally you also follow the $unwind with a $match condition, which also creates a matching argument to also be added to the $lookup. You can actually see this in the explain output for the pipeline.
The topic is actually covered (briefly) in a section of Aggregation Pipeline Optimization in the core documentation:
$lookup + $unwind Coalescence
New in version 3.2.
When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.
Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit:
const MongoClient = require('mongodb').MongoClient;
const uri = 'mongodb://localhost/test';
function data(data) {
console.log(JSON.stringify(data, undefined, 2))
}
(async function() {
let db;
try {
db = await MongoClient.connect(uri);
console.log('Cleaning....');
// Clean data
await Promise.all(
["source","edge"].map(c => db.collection(c).remove() )
);
console.log('Inserting...')
await db.collection('edge').insertMany(
Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
);
await db.collection('source').insert({ _id: 1 })
console.log('Fattening up....');
await db.collection('edge').updateMany(
{},
{ $set: { data: "x".repeat(100000) } }
);
// The full pipeline. Failing test uses only the $lookup stage
let pipeline = [
{ $lookup: {
from: 'edge',
localField: '_id',
foreignField: 'gid',
as: 'results'
}},
{ $unwind: '$results' },
{ $match: { 'results._id': { $gte: 1, $lte: 5 } } },
{ $project: { 'results.data': 0 } },
{ $group: { _id: '$_id', results: { $push: '$results' } } }
];
// List and iterate each test case
let tests = [
'Failing.. Size exceeded...',
'Working.. Applied $unwind...',
'Explain output...'
];
for (let [idx, test] of Object.entries(tests)) {
console.log(test);
try {
let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
options = (( +idx === tests.length-1 ) ? { explain: true } : {});
await new Promise((end,error) => {
let cursor = db.collection('source').aggregate(currpipe,options);
for ( let [key, value] of Object.entries({ error, end, data }) )
cursor.on(key,value);
});
} catch(e) {
console.error(e);
}
}
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
After inserting some initial data, the listing will attempt to run an aggregate merely consisting of $lookup which will fail with the following error:
{ MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size
Which is basically telling you the BSON limit was exceeded on retrieval.
By contrast the next attempt adds the $unwind and $match pipeline stages
The Explain output:
{
"$lookup": {
"from": "edge",
"as": "results",
"localField": "_id",
"foreignField": "gid",
"unwinding": { // $unwind now is unwinding
"preserveNullAndEmptyArrays": false
},
"matching": { // $match now is matching
"$and": [ // and actually executed against
{ // the foreign collection
"_id": {
"$gte": 1
}
},
{
"_id": {
"$lte": 5
}
}
]
}
}
},
// $unwind and $match stages removed
{
"$project": {
"results": {
"data": false
}
}
},
{
"$group": {
"_id": "$_id",
"results": {
"$push": "$results"
}
}
}
And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded.
This really just happens as a result of adding $unwind only, but the $match is added for example to show that this is also added into the $lookup stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup operation and no other results other than those matching are actually returned.
By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want $group the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup.
MongoDB 3.6 and Above - Additional for "LEFT JOIN"
As all the content above notes, the BSON Limit is a "hard" limit that you cannot breach and this is generally why the $unwind is necessary as an interim step. There is however the limitation that the "LEFT JOIN" becomes an "INNER JOIN" by virtue of the $unwind where it cannot preserve the content. Also even preserveNulAndEmptyArrays would negate the "coalescence" and still leave the intact array, causing the same BSON Limit problem.
MongoDB 3.6 adds new syntax to $lookup that allows a "sub-pipeline" expression to be used in place of the "local" and "foreign" keys. So instead of using the "coalescence" option as demonstrated, as long as the produced array does not also breach the limit it is possible to put conditions in that pipeline which returns the array "intact", and possibly with no matches as would be indicative of a "LEFT JOIN".
The new expression would then be:
{ "$lookup": {
"from": "edge",
"let": { "gid": "$gid" },
"pipeline": [
{ "$match": {
"_id": { "$gte": 1, "$lte": 5 },
"$expr": { "$eq": [ "$$gid", "$to" ] }
}}
],
"as": "from"
}}
In fact this would be basically what MongoDB is doing "under the covers" with the previous syntax since 3.6 uses $expr "internally" in order to construct the statement. The difference of course is there is no "unwinding" option present in how the $lookup actually gets executed.
If no documents are actually produced as a result of the "pipeline" expression, then the target array within the master document will in fact be empty, just as a "LEFT JOIN" actually does and would be the normal behavior of $lookup without any other options.
However the output array to MUST NOT cause the document where it is being created to exceed the BSON Limit. So it really is up to you to ensure that any "matching" content by the conditions stays under this limit or the same error will persist, unless of course you actually use $unwind to effect the "INNER JOIN".
I had same issue with fllowing Node.js query becuase 'redemptions' collection has more then 400,000 of data. I am using Mongo DB server 4.2 and Node JS driver 3.5.3.
db.collection('businesses').aggregate(
{
$lookup: { from: 'redemptions', localField: "_id", foreignField: "business._id", as: "redemptions" }
},
{
$project: {
_id: 1,
name: 1,
email: 1,
"totalredemptions" : {$size:"$redemptions"}
}
}
I have modified query as below to make it work super fast.
db.collection('businesses').aggregate(query,
{
$lookup:
{
from: 'redemptions',
let: { "businessId": "$_id" },
pipeline: [
{ $match: { $expr: { $eq: ["$business._id", "$$businessId"] } } },
{ $group: { _id: "$_id", totalCount: { $sum: 1 } } },
{ $project: { "_id": 0, "totalCount": 1 } }
],
as: "redemptions"
},
{
$project: {
_id: 1,
name: 1,
email: 1,
"totalredemptions" : {$size:"$redemptions"}
}
}
}
how can i combine match document's subdocument together as one and return it as an array of object ? i have tried $group but don't seem to work.
my query ( this return array of object in this case there are two )
User.find({
'business_details.business_location': {
$near: coords,
$maxDistance: maxDistance
},
'deal_details.deals_expired_date': {
$gte: new Date()
}
}, {
'deal_details': 1
}).limit(limit).exec(function(err, locations) {
if (err) {
return res.status(500).json(err)
}
console.log(locations)
the console.log(locations) result
// give me the result below
[{
_id: 55 c0b8c62fd875a93c8ff7ea, // first document
deal_details: [{
deals_location: '101.6833,3.1333',
deals_price: 12.12 // 1st deal
}, {
deals_location: '101.6833,3.1333',
deals_price: 34.3 // 2nd deal
}],
business_details: {}
}, {
_id: 55 a79898e0268bc40e62cd3a, // second document
deal_details: [{
deals_location: '101.6833,3.1333',
deals_price: 12.12 // 3rd deal
}, {
deals_location: '101.6833,3.1333',
deals_price: 34.78 // 4th deal
}, {
deals_location: '101.6833,3.1333',
deals_price: 34.32 // 5th deal
}],
business_details: {}
}]
what i wanted to do is to combine these both deal_details field together and return it as an array of object. It will contain 5 deals in one array of object instead of two separated array of objects.
i have try to do it in my backend (nodejs) by using concat or push, however when there's more than 2 match document i'm having problem to concat them together, is there any way to combine all match documents and return it as one ? like what i mentioned above ?
What you are probably missing here is the $unwind pipeline stage, which is what you typically use to "de-normalize" array content, particularly when your grouping operation intends to work across documents in your query result:
User.aggregate(
[
// Your basic query conditions
{ "$match": {
"business_details.business_location": {
"$near": coords,
"$maxDistance": maxDistance
},
"deal_details.deals_expired_date": {
"$gte": new Date()
}},
// Limit query results here
{ "$limit": limit },
// Unwind the array
{ "$unwind": "$deal_details" },
// Group on the common location
{ "$group": {
"_id": "$deal_details.deals_location",
"prices": {
"$push": "$deal_details.deals_price"
}
}}
],
function(err,results) {
if (err) throw err;
console.log(JSON.stringify(results,undefined,2));
}
);
Which gives output like:
{
"_id": "101.6833,3.1333",
"prices": [
12.12,
34.3,
12.12,
34.78,
34.32
]
}
Depending on how many documents actually match the grouping.
Alternately, you might want to look at the $geoNear pipeline stage, which gives a bit more control, especially when dealing with content in arrays.
Also beware that with "location" data in an array, only the "nearest" result is being considered here and not "all" of the array content. So other items in the array may not be actually "near" the queried point. That is more of a design consideration though as any query operation you do will need to consider this.
You can merge them with reduce:
locations = locations.reduce(function(prev, location){
previous = prev.concat(location.deal_details)
return previous
},[])
I have the following mongodb query in node.js which gives me a list of unique zip codes with a count of how many times the zip code appears in the database.
collection.aggregate( [
{
$group: {
_id: "$Location.Zip",
count: { $sum: 1 }
}
},
{ $sort: { _id: 1 } },
{ $match: { count: { $gt: 1 } } }
], function ( lookupErr, lookupData ) {
if (lookupErr) {
res.send(lookupErr);
return;
}
res.send(lookupData.sort());
});
});
How can this query be modified to return one specific zip code? I've tried the condition clause but have not been able to get it to work.
Aggregations that require filtered results can be done with the $match operator. Without tweaking what you already have, I would suggest just sticking in a $match for the zip code you want returned at the top of the aggregation list.
collection.aggregate( [
{
$match: {
zip: 47421
}
},
{
$group: {
...
This example will result in every aggregation operation after the $match working on only the data set that is returned by the $match of the zip key to the value 47421.
in the $match pipeline operator add
{ $match: { count: { $gt: 1 },
_id : "10002" //replace 10002 with the zip code you want
}}
As a side note, you should put the $match operator first and in general as high in the aggregation chain as you can.