mongodb caching with aggregation pipeline - pagination

mongodb aggregate pipeline like
db.testing.aggregate(
{
$match : {hosting : "aws.amazon.com"}
},
{
$group : { _id : "$hosting", total : { $sum : 1 } }
},
{
$project : { title : 1 , author : 1, <few other transformations> }
{$sort : {total : -1}}
);
Now I want to enable paging. I have 2 options.
Use skip and limit in the pipeline.
{ $skip : pageNumber * pageSize }
{ $limit : pageSize }
External API level caching for each page can be used which will reduce time for repeated loading of same pages, but the first loading of each page will be painful because of the linear scan due to sorting.
Handle pagination in application.
Cache the findAll result i.e for List findAll();
Now pagination will be handled at the service layer and result will be published
From next request onward you will be referring to the cached result and send the desired set of records from the cache.
Question: 2nd approach seems better if database is not doing some magical optimizations. In 1st, my view is that since the pipeline involves sorting, hence every page request will do a scan of the full table, which will be sub-optimal. What are your views? Which one should be done? What would you choose? What is the good practice(Is moving some db logic to service layer for optimizations advisable)?

It depends on your data.
MongoDB does not cache the query results in order to return the cached results for identical queries. https://docs.mongodb.com/manual/faq/fundamentals/#does-mongodb-handle-caching
However, you may create View (from source + pipeline) and update it on-demand. This will allow you to have aggregated data with good performance for paging and update the content periodically. You may create indexes for better performance (No need to develop in service layer extra logic)
Also, if you always filter and $group by hosting field, you may benefit MongoDB index swapping last $sort next ot $match stage. In this case, MongoDB will use index for filter + sort and paging are done in memory.
db.testing.createIndex({hosting:-1})
db.collection.aggregate([
{
$match: {
hosting: "aws.amazon.com"
}
},
{
$sort: {
hosting: -1
}
},
{
$group: {
_id: "$hosting",
title: {
$first: "$title"
},
author: {
$first: "$author"
},
total: {
$sum: 1
}
}
},
{
$project: {
title: 1,
author: 1,
total: 1
}
},
{ $skip : pageNumber * pageSize },
{ $limit : pageSize }
])

Related

Mongo db - how to join and sort two collection with pagination

I have 2 collections:
Office -
{
_id: ObjectId(someOfficeId),
name: "some name",
..other fields
}
Documents -
{
_id: ObjectId(SomeId),
name: "Some document name",
officeId: ObjectId(someOfficeId),
...etc
}
I need to get list of offices sorted by count of documetns that refer to office. Also should be realized pagination.
I tryied to do this by aggregation and using $lookup
const aggregation = [
{
$lookup: {
from: 'documents',
let: {
id: '$id'
},
pipeline: [
{
$match: {
$expr: {
$eq: ['$officeId', '$id']
},
// sent_at: {
// $gte: start,
// $lt: end,
// },
}
}
],
as: 'documents'
},
},
{ $sortByCount: "$documents" },
{ $skip: (page - 1) * limit },
{ $limit: limit },
];
But this doesn't work for me
Any Ideas how to realize this?
p.s. I need to show offices with 0 documents, so get offices by documets - doesn't work for me
Query
you can use lookup to join on that field, and pipeline to group so you count the documents of each office (instead of putting the documents into an array, because you only case for the count)
$set is to get that count at top level field
sort using the noffices field
you can use the skip/limit way for pagination, but if your collection is very big it will be slow see this. Alternative you can do the pagination using the _id natural order, or retrieve more document in each query and have them in memory (instead of retriving just 1 page's documents)
Test code here
offices.aggregate(
[{"$lookup":
{"from":"documents",
"localField":"_id",
"foreignField":"officeId",
"pipeline":[{"$group":{"_id":null, "count":{"$sum":1}}}],
"as":"noffices"}},
{"$set":
{"noffices":
{"$cond":
[{"$eq":["$noffices", []]}, 0,
{"$arrayElemAt":["$noffices.count", 0]}]}}},
{"$sort":{"noffices":-1}}])
As the other answer pointed out you forgot the _ of id, but you don't need the let or match inside the pipeline with $expr, with the above lookup. Also $sortByCount doesn't count the member of an array, you would need $size (sort by count is just group and count its not for arrays). But you dont need $size also you can count them in the pipeline, like above.
Edit
Query
you can add in the pipeline what you need or just remove it
this keeps all documents, and counts the array size
and then sorts
Test code here
offices.aggregate(
[{"$lookup":
{"from":"documents",
"localField":"_id",
"foreignField":"officeId",
"pipeline":[],
"as":"alldocuments"}},
{"$set":{"ndocuments":{"$size":"$alldocuments"}}},
{"$sort":{"ndocuments":-1}}])
There are two errors in your lookup
While passing the variable in with $let. You forgot the _ of the $_id local field
let: {
id: '$id'
},
In the $exp, since you are using a variable id and not a field of the
Documents collection, you should use $$ to make reference to the variable.
$expr: {
$eq: ['$officeId', '$$id']
},

How to find the max length of an array from a set of documents present in a collection in MongoDB?

I have 'n' number of documents present inside a collection in MongoDB.
Structure of those documents is as follows:
{
"_id": "...",
"submissions": [{...}, ...]
}
I want to find the document which has the highest number of submissions out of all the documents present.
Is there any Mongo find/aggregation query which can do the same?
I don't think any straight way to achieve this,
You can try below aggregation query,
$addFields to add new field totalSubmissions to get total elements in submissions array
$sort by totalSubmissions in descending order
$limit to select single document
collection.aggregate([
{ $addFields: { totalSubmissions: { $size: "$submissions" } } },
{ $sort: { totalSubmissions: -1 } },
{ $limit: 1 }
])
Playground

Speed ​issue and also memory error when querying in Mongo

I have a table that contains over 100,000 records. Server: node.js/express.js. DB: mongo
On the client, a table with a pager is implemented. 10 records are requested each time.
When there were 10,000 records, of course, everything worked faster, but now there was a problem with speed and not only.
My aggregation:
import { concat } from 'lodash';
...
let query = [{$match: {}}];
query = concat(query, [{$sort : {createdAt: -1}}]);
query = concat(query, [
{$skip : (pageNum - 1) * perPage}, // 0
{$limit : perPage} // 10
]);
return User.aggregate(query)
.collation({locale: 'en', strength: 2})
.then((users) => ...;
2 cases:
first fetch very slow
when I click to last page I got error:
MongoError: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.
Please, tell me, I am building the aggregation incorrectly, or is there a problem with memory on the server as the error says and additional nginx settings are needed (another person is engaged in this) or is the problem complex, or perhaps something else altogether?
Added:
I noticed that the index is not used when sorting, although it should be used?
aggregation to execute console.log =>
[
{
"$match": {}
},
{
"$lookup": {
...
}
},
{
"$project": {
...,
createdAt: 1,
...
}
},
{
"$match": {}
},
{
"$sort": {
"createdAt": -1
}
},
{
"$skip": 0
},
{
"$limit": 10
}
]
Thanks for any answers and sorry my English :)
It does say that you've memory limit, which makes sense, considering that you're trying to filter through 100,000 requests. I'd try using return User.aggregate(query, { allowDiskUse: true }) //etc, and see if that helps your issue.
Whilst this isn't the documentation on the Node.js driver specifically, this link summaries what the allowDiskUse option does (or in short, it allows MongoDB to go past the 100MB memory limit, and uses your system storage to temporarily store some information while it performs the query).

Find after aggregate in MongoDB

{
"_id" : ObjectId("5852725660632d916c8b9a38"),
"response_log" : [
{
"campaignId" : "AA",
"created_at" : ISODate("2016-12-20T11:53:55.727Z")
},
{
"campaignId" : "AB",
"created_at" : ISODate("2016-12-20T11:55:55.727Z")
}]
}
I have a document which contains an array. I want to select all those documents that do not have response_log.created_at in last 2 hours from current time and count of response_log.created_at in last 24 is less than 3.
I am unable to figure out how to go about it. Please help
You can use the aggregation framework to filter the documents. A pipeline with $match and $redact steps will do the filtering.
Consider running the following aggregate operation where $redact allows you to proccess the logical condition with the $cond operator and uses the system variables $$KEEP to "keep" the document where the logical condition is true or $$PRUNE to "remove" the document where the condition was false.
This operation is similar to having a $project pipeline that selects the fields in the collection and creates a new field that holds the result from the logical condition query and then a subsequent $match, except that $redact uses a single pipeline stage which is more efficient:
var moment = require('moment'),
last2hours = moment().subtract(2, 'hours').toDate(),
last24hours = moment().subtract(24, 'hours').toDate();
MongoClient.connect(config.database)
.then(function(db) {
return db.collection('MyCollection')
})
.then(function (collection) {
return collection.aggregate([
{ '$match': { 'response_log.created_at': { '$gt': last2hours } } },
{
'$redact': {
'$cond': [
{
'$lt': [
{
'$size': {
'$filter': {
'input': '$response_log',
'as': 'res',
'cond': {
'$lt': [
'$$res.created_at',
last24hours
]
}
}
}
},
3
]
},
'$$KEEP',
'$$PRUNE'
]
}
}
]).toArray();
})
.then(function(docs) {
console.log(docs)
})
.catch(function(err) {
throw err;
});
Explanations
In the above aggregate operation, if you execute the first $match pipeline step
collection.aggregate([
{ '$match': { 'response_log.created_at': { '$gt': last2hours } } }
])
The documents returned will be the ones that do not have "response_log.created_at" in last 2 hours from current time where the variable last2hours is created with the momentjs library using the subtract API.
The preceding pipeline with $redact will then further filter the documents from the above by using the $cond ternary operator that evaluates this logical expression that uses $size to get the count and $filter to return a filtered array with elements that match other logical condition
{
'$lt': [
{
'$size': {
'$filter': {
'input': '$response_log',
'as': 'res',
'cond': { '$lt': ['$$res.created_at', last24hours] }
}
}
},
3
]
}
to $$KEEP the document if this condition is true or $$PRUNE to "remove" the document where the evaluated condition is false.
I know that this is probably not the answer that you're looking for but this may not be the best use case for Mongo. It's easy to do that in a relational database, it's easy to do that in a database that supports map/reduce but it will not be straightforward in Mongo.
If your data looked different and you kept each log entry as a separate document that references the object (with id 5852725660632d916c8b9a38 in this case) instead of being a part of it, then you could make a simple query for the latest log entry that has that id. This is what I would do in your case if I ware to use Mongo for that (which I wouldn't).
What you can also do is keep a separate collection in Mongo, or add a new property to the object that you have here which would store the latest date of campaign added. Then it would be very easy to search for what you need.
When you are working with a database like Mongo then how your data looks like must reflect what you need to do with it, like in this case. Adding a last campaign date and updating it on every campaign added would let you search for those campaign that you need very easily.
If you want to be able to make any searches and aggregates possible then you may be better off using a relational database.

Combine multiple query with one single $in query and specify limit for each array field?

I am using mongoose with node.js for MongoDB. Now i need to make 20 parallel find query requests in my database with limit of documents 4, same as shown below just brand_id will change for different brand.
areamodel.find({ brand_id: brand_id }, { '_id': 1 }, { limit: 4 }, function(err, docs) {
if (err) {
console.log(err);
} else {
console.log('fetched');
}
}
Now as to run all these query parallely i thought about putting all 20 brand_id in a array of string and then use a $in query to get the results, but i don't know how to specify the limit 4 for every array field which will be matched.
I write below code with aggregation but don't know where to specify limit for each element of my array.
var brand_ids = ["brandid1", "brandid2", "brandid3", "brandid4", "brandid5", "brandid6", "brandid7", "brandid8", "brandid9", "brandid10", "brandid11", "brandid12", "brandid13", "brandid14", "brandid15", "brandid16", "brandid17", "brandid18", "brandid19", "brandid20"];
areamodel.aggregate(
{ $project: { _id: 1 } },
{ $match : { 'brand_id': { $in: brand_ids } } },
function(err, docs) {
if (err) {
console.error(err);
} else {
}
}
);
Can anyone please tell me how can i solve my problem using only one query.
UPDATE- Why i don't think $group be helpful for me.
Suppose my brand_ids array contains these strings
brand_ids = ["id1", "id2", "id3", "id4", "id5"]
and my database have below documents
{
"brand_id": "id1",
"name": "Levis",
"loc": "india"
},
{
"brand_id": "id1",
"name": "Levis"
"loc": "america"
},
{
"brand_id": "id2",
"name": "Lee"
"loc": "india"
},
{
"brand_id": "id2",
"name": "Lee"
"loc": "america"
}
Desired JSON output
{
"name": "Levis"
},
{
"name": "Lee"
}
For above example suppose i have 25000 documents with "name" as "Levis" and 25000 of documents where "name" is "Lee", now if i will use group then all of 50000 documents will be queried and grouped by "name".
But according to the solution i want, when first document with "Levis" and "Lee" gets found then i will don't have to look for remaining thousands of the documents.
Update- I think if anyone of you can tell me this then probably i can get to my solution.
Consider a case where i have 1000 total documents in my mongoDB, now suppose out of that 1000, 100 will pass my match query.
Now if i will apply limit 4 on this query then will this query take same time to execute as the query without any limit, or not.
Why i am thinking about this case
Because if my query will take same time then i don't think $group will increase my time as all documents will be queried.
But if time taken by limit query is more than the time taken without the limit query then.
If i can apply limit 4 on each array element then my question will be solved.
If i cannot apply limit on each array element then i don't think $group will be useful, as in this case i have to scan whole documents to get the results.
FINAL UPDATE- As i read on below answer and also on mongodb docs that by using $limit, time taken by query does not get affected it is the network bandwidth that gets compromised. So i think if anyone of you can tell me how to apply limit on array fields (by using $group or anything other than that)then my problem will get solved.
mongodb: will limit() increase query speed?
Solution
Actually my thinking about mongoDB was very wrong i thought adding limit with queries decrease time taken by query but it is not the case that's why i stumbled so many days to try the answer which Gregory NEUT and JohnnyHK Told me to. Thanks a lot both of you guys i must have found the solution at the day one if i had known about this thing. thanks alot for helping me out of here guys i really appreciate it.
I propose you to use the $group aggregation attribute to group all data you got from the $match by brand_id, and then limit the groups of data using $slice.
Look at this stack overflow post
db.collection.aggregate(
{
$sort: {
created: -1,
}
}, {
$group: {
_id: '$city',
title: {
$push: '$title',
}
}, {
$project: {
_id: 0,
city: '$_id',
mostRecentTitle: {
$slice: ['$title', 0, 2],
}
}
})
I propose using distinct, since that will return all different brand names in your collection. (I assume this is what you are trying to achieve?)
db.runCommand ( { distinct: "areamodel", key: "name" } )
MongoDB docs
In mongoose i think it is: areamodel.db.db.command({ distinct: "areamodel", key: "name" }) (Untested)

Resources