Aggregate and group the mongodb data [duplicate] - node.js

For example, I have these documents:
{
"addr": "address1",
"book": "book1"
},
{
"addr": "address2",
"book": "book1"
},
{
"addr": "address1",
"book": "book5"
},
{
"addr": "address3",
"book": "book9"
},
{
"addr": "address2",
"book": "book5"
},
{
"addr": "address2",
"book": "book1"
},
{
"addr": "address1",
"book": "book1"
},
{
"addr": "address15",
"book": "book1"
},
{
"addr": "address9",
"book": "book99"
},
{
"addr": "address90",
"book": "book33"
},
{
"addr": "address4",
"book": "book3"
},
{
"addr": "address5",
"book": "book1"
},
{
"addr": "address77",
"book": "book11"
},
{
"addr": "address1",
"book": "book1"
}
and so on.How can I make a request, which will describe the top N addresses and the top M books per address?Example of expected result: address1 | book_1: 5 | book_2: 10 | book_3: 50 | total: 65 ______________________ address2 | book_1: 10 | book_2: 10 |... | book_M: 10 | total: M*10... ______________________ addressN | book_1: 20 | book_2: 20 |... | book_M: 20 | total: M*20

TLDR Summary
In modern MongoDB releases you can brute force this with $slice just off the basic aggregation result. For "large" results, run parallel queries instead for each grouping ( a demonstration listing is at the end of the answer ), or wait for SERVER-9377 to resolve, which would allow a "limit" to the number of items to $push to an array.
db.books.aggregate([
{ "$group": {
"_id": {
"addr": "$addr",
"book": "$book"
},
"bookCount": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.addr",
"books": {
"$push": {
"book": "$_id.book",
"count": "$bookCount"
},
},
"count": { "$sum": "$bookCount" }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$project": {
"books": { "$slice": [ "$books", 2 ] },
"count": 1
}}
])
MongoDB 3.6 Preview
Still not resolving SERVER-9377, but in this release $lookup allows a new "non-correlated" option which takes an "pipeline" expression as an argument instead of the "localFields" and "foreignFields" options. This then allows a "self-join" with another pipeline expression, in which we can apply $limit in order to return the "top-n" results.
db.books.aggregate([
{ "$group": {
"_id": "$addr",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$lookup": {
"from": "books",
"let": {
"addr": "$_id"
},
"pipeline": [
{ "$match": {
"$expr": { "$eq": [ "$addr", "$$addr"] }
}},
{ "$group": {
"_id": "$book",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
],
"as": "books"
}}
])
The other addition here is of course the ability to interpolate the variable through $expr using $match to select the matching items in the "join", but the general premise is a "pipeline within a pipeline" where the inner content can be filtered by matches from the parent. Since they are both "pipelines" themselves we can $limit each result separately.
This would be the next best option to running parallel queries, and actually would be better if the $match were allowed and able to use an index in the "sub-pipeline" processing. So which is does not use the "limit to $push" as the referenced issue asks, it actually delivers something that should work better.
Original Content
You seem have stumbled upon the top "N" problem. In a way your problem is fairly easy to solve though not with the exact limiting that you ask for:
db.books.aggregate([
{ "$group": {
"_id": {
"addr": "$addr",
"book": "$book"
},
"bookCount": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.addr",
"books": {
"$push": {
"book": "$_id.book",
"count": "$bookCount"
},
},
"count": { "$sum": "$bookCount" }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
])
Now that will give you a result like this:
{
"result" : [
{
"_id" : "address1",
"books" : [
{
"book" : "book4",
"count" : 1
},
{
"book" : "book5",
"count" : 1
},
{
"book" : "book1",
"count" : 3
}
],
"count" : 5
},
{
"_id" : "address2",
"books" : [
{
"book" : "book5",
"count" : 1
},
{
"book" : "book1",
"count" : 2
}
],
"count" : 3
}
],
"ok" : 1
}
So this differs from what you are asking in that, while we do get the top results for the address values the underlying "books" selection is not limited to only a required amount of results.
This turns out to be very difficult to do, but it can be done though the complexity just increases with the number of items you need to match. To keep it simple we can keep this at 2 matches at most:
db.books.aggregate([
{ "$group": {
"_id": {
"addr": "$addr",
"book": "$book"
},
"bookCount": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.addr",
"books": {
"$push": {
"book": "$_id.book",
"count": "$bookCount"
},
},
"count": { "$sum": "$bookCount" }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$unwind": "$books" },
{ "$sort": { "count": 1, "books.count": -1 } },
{ "$group": {
"_id": "$_id",
"books": { "$push": "$books" },
"count": { "$first": "$count" }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"books": "$books",
"count": "$count"
},
"newBooks": "$books"
}},
{ "$unwind": "$newBooks" },
{ "$group": {
"_id": "$_id",
"num1": { "$first": "$newBooks" }
}},
{ "$project": {
"_id": "$_id",
"newBooks": "$_id.books",
"num1": 1
}},
{ "$unwind": "$newBooks" },
{ "$project": {
"_id": "$_id",
"num1": 1,
"newBooks": 1,
"seen": { "$eq": [
"$num1",
"$newBooks"
]}
}},
{ "$match": { "seen": false } },
{ "$group":{
"_id": "$_id._id",
"num1": { "$first": "$num1" },
"num2": { "$first": "$newBooks" },
"count": { "$first": "$_id.count" }
}},
{ "$project": {
"num1": 1,
"num2": 1,
"count": 1,
"type": { "$cond": [ 1, [true,false],0 ] }
}},
{ "$unwind": "$type" },
{ "$project": {
"books": { "$cond": [
"$type",
"$num1",
"$num2"
]},
"count": 1
}},
{ "$group": {
"_id": "$_id",
"count": { "$first": "$count" },
"books": { "$push": "$books" }
}},
{ "$sort": { "count": -1 } }
])
So that will actually give you the top 2 "books" from the top two "address" entries.
But for my money, stay with the first form and then simply "slice" the elements of the array that are returned to take the first "N" elements.
Demonstration Code
The demonstration code is appropriate for usage with current LTS versions of NodeJS from v8.x and v10.x releases. That's mostly for the async/await syntax, but there is nothing really within the general flow that has any such restriction, and adapts with little alteration to plain promises or even back to plain callback implementation.
index.js
const { MongoClient } = require('mongodb');
const fs = require('mz/fs');
const uri = 'mongodb://localhost:27017';
const log = data => console.log(JSON.stringify(data, undefined, 2));
(async function() {
try {
const client = await MongoClient.connect(uri);
const db = client.db('bookDemo');
const books = db.collection('books');
let { version } = await db.command({ buildInfo: 1 });
version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);
// Clear and load books
await books.deleteMany({});
await books.insertMany(
(await fs.readFile('books.json'))
.toString()
.replace(/\n$/,"")
.split("\n")
.map(JSON.parse)
);
if ( version >= 3.6 ) {
// Non-correlated pipeline with limits
let result = await books.aggregate([
{ "$group": {
"_id": "$addr",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$lookup": {
"from": "books",
"as": "books",
"let": { "addr": "$_id" },
"pipeline": [
{ "$match": {
"$expr": { "$eq": [ "$addr", "$$addr" ] }
}},
{ "$group": {
"_id": "$book",
"count": { "$sum": 1 },
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
]
}}
]).toArray();
log({ result });
}
// Serial result procesing with parallel fetch
// First get top addr items
let topaddr = await books.aggregate([
{ "$group": {
"_id": "$addr",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
]).toArray();
// Run parallel top books for each addr
let topbooks = await Promise.all(
topaddr.map(({ _id: addr }) =>
books.aggregate([
{ "$match": { addr } },
{ "$group": {
"_id": "$book",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
]).toArray()
)
);
// Merge output
topaddr = topaddr.map((d,i) => ({ ...d, books: topbooks[i] }));
log({ topaddr });
client.close();
} catch(e) {
console.error(e)
} finally {
process.exit()
}
})()
books.json
{ "addr": "address1", "book": "book1" }
{ "addr": "address2", "book": "book1" }
{ "addr": "address1", "book": "book5" }
{ "addr": "address3", "book": "book9" }
{ "addr": "address2", "book": "book5" }
{ "addr": "address2", "book": "book1" }
{ "addr": "address1", "book": "book1" }
{ "addr": "address15", "book": "book1" }
{ "addr": "address9", "book": "book99" }
{ "addr": "address90", "book": "book33" }
{ "addr": "address4", "book": "book3" }
{ "addr": "address5", "book": "book1" }
{ "addr": "address77", "book": "book11" }
{ "addr": "address1", "book": "book1" }

Using aggregate function like below :
[
{$group: {_id : {book : '$book',address:'$addr'}, total:{$sum :1}}},
{$project : {book : '$_id.book', address : '$_id.address', total : '$total', _id : 0}}
]
it will give you result like following :
{
"total" : 1,
"book" : "book33",
"address" : "address90"
},
{
"total" : 1,
"book" : "book5",
"address" : "address1"
},
{
"total" : 1,
"book" : "book99",
"address" : "address9"
},
{
"total" : 1,
"book" : "book1",
"address" : "address5"
},
{
"total" : 1,
"book" : "book5",
"address" : "address2"
},
{
"total" : 1,
"book" : "book3",
"address" : "address4"
},
{
"total" : 1,
"book" : "book11",
"address" : "address77"
},
{
"total" : 1,
"book" : "book9",
"address" : "address3"
},
{
"total" : 1,
"book" : "book1",
"address" : "address15"
},
{
"total" : 2,
"book" : "book1",
"address" : "address2"
},
{
"total" : 3,
"book" : "book1",
"address" : "address1"
}
I didn't quite get your expected result format, so feel free to modify this to one you need.

Below query will provide exactly the same result as given in the desired response:
db.books.aggregate([
{
$group: {
_id: { addresses: "$addr", books: "$book" },
num: { $sum :1 }
}
},
{
$group: {
_id: "$_id.addresses",
bookCounts: { $push: { bookName: "$_id.books",count: "$num" } }
}
},
{
$project: {
_id: 1,
bookCounts:1,
"totalBookAtAddress": {
"$sum": "$bookCounts.count"
}
}
}
])
The response will be looking like below:
/* 1 */
{
"_id" : "address4",
"bookCounts" : [
{
"bookName" : "book3",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 2 */
{
"_id" : "address90",
"bookCounts" : [
{
"bookName" : "book33",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 3 */
{
"_id" : "address15",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 4 */
{
"_id" : "address3",
"bookCounts" : [
{
"bookName" : "book9",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 5 */
{
"_id" : "address5",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 6 */
{
"_id" : "address1",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 3
},
{
"bookName" : "book5",
"count" : 1
}
],
"totalBookAtAddress" : 4
},
/* 7 */
{
"_id" : "address2",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 2
},
{
"bookName" : "book5",
"count" : 1
}
],
"totalBookAtAddress" : 3
},
/* 8 */
{
"_id" : "address77",
"bookCounts" : [
{
"bookName" : "book11",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 9 */
{
"_id" : "address9",
"bookCounts" : [
{
"bookName" : "book99",
"count" : 1
}
],
"totalBookAtAddress" : 1
}

Since mongoDB version 3.6 this is easy to do, using $group, $slice, $limit, and $sort:
$group the books to count them
$sort so they will be later pushed according to count
$group by address, $push relevant books, and $sum the total per address.
$sort by address total
$limit the address results to topN
Limit the books in the array to topM using $slice
db.collection.aggregate([
{$group: {_id: {book: "$book", addr: "$addr"}, count: {$sum: 1}}},
{$sort: {"_id.addr": 1, count: -1}},
{$group: {
_id: "$_id.addr", totalCount: {$sum: "$count"},
books: {$push: {book: "$_id.book", count: "$count"}}
}
},
{$sort: {totalCount: -1}},
{$limit: topN},
{$set: {addr: "$_id", _id: "$$REMOVE", books: {$slice: ["$books", 0, topM]}}}
])
See how it works on the playground example-v3.4
On mongoDB version 5.2 there is a topN accumulator that can simplify even more:
db.collection.aggregate([
{$group: {_id: {book: "$book", addr: "$addr"}, count: {$sum: 1}}},
{$group: {
_id: "$_id.addr",
totalCount: {$sum: "$count"},
books: {$topN: {output: {book: "$_id.book", count: "$count"},
sortBy: {count: -1},
n: topM
}}
}},
{$sort: {totalCount: -1}},
{$limit: topN},
{$project: {addr: "$_id", _id: 0, books: 1, totalCount: 1}}
])
See how it works on the playground example-v5.2

Related

Mongoose aggregate to get Average rating, count each rate and return the actual ratings

Im trying to to get avg rating for a product, plus the count of each rating and also return the actual ratings and use pagination to limit amount that is returned without affecting the avg or count.
So I'm trying achieve something like this:
this is my rating collection:
{
"productId": "3"
"userid" : 5,
"rating" : 5
"comment": "this is nice"
},
{
"productId": "3"
"userid" : 2,
"rating" :4
"comment": "this is very nice"
}
and this is the end result I want
{
"_id" : 1,
"avgRating": "3.6"
"counts" : [
{
"rating" : 5,
"count" : 8
},
{
"rating" : 3,
"count" : 2
},
{
"rating" : 4,
"count" : 4
},
{
"rating" : 1,
"count" : 4
}
],
"ratings": [
{
"productId": "3"
"userid" : 5,
"rating" : 5
"comment": "this is nice"
},
{
"productId": "3"
"userid" : 2,
"rating" :4
"comment": "this is very nice"
},
{
"productId": "3"
"userid" : 12,
"rating" : 4
"comment": "this is okay"
}
]
}
I have this so far which give me the count for each rating:
db.votes.aggregate([
{ $match: { postId: {$in: [1,2]} } },
{
$group: { _id: { post: "$postId", rating: "$vote" }, count: { $sum: 1 } }
},
{
$group: {
_id: "$_id.post",
counts: { $push: { rating: "$_id.rating", count: "$count" } }
}
}
])
You're not far off, we just have to adjust some things:
db.votes.aggregate([
{
$match:
{
postId: {$in: [1, 2]}
}
},
{
$group: {
_id: {post: "$postId", rating: "$vote"},
count: {$sum: 1},
reviews: {$push : "$$ROOT" } //keep the original document
}
},
{
$group: {
_id: "$_id.post",
counts: {$push: {rating: "$_id.rating", count: "$count"}},
reviews: {$push: "$reviews"},
totalItemCount: {$sum: "$count"}, //for avg calculation
totalRating: {$sum: "$_id.rating"} // //for avg calculation
}
},
{
$project: {
_id: "$_id",
avgRating: {$divide: ["$totalRating", "$totalItemCount"]},
counts: "$counts",
reviews: {
$slice: [
{
$reduce: {
input: "$reviews",
initialValue: [],
in: { $concatArrays: ["$$value", "$$this"] }
}
},
0, //skip
10 //limit
]
}
}
}
])
Note that I preserved the current pipeline structure for clarity, however I feel that using a pipeline that utilizes $facet might be more efficient as we won't have to hold the entire collection in memory while grouping.
we'll split it into two, one the current pipeline minus the review section and one with just $skip and $limit stages.
EDIT:
$facet version:
db.votes.aggregate([
{
"$match": {
"postId": {"$in": [1, 2]}
}
},
{
"$facet": {
"numbers": [
{
"$group": {
"_id": {
"post": "$postId",
"rating": "$vote"
},
"count": {
"$sum": 1.0
}
}
},
{
"$group": {
"_id": "$_id.post",
"counts": {
"$push": {
"rating": "$_id.rating",
"count": "$count"
}
},
"totalItemCount": {
"$sum": "$count"
},
"totalRating": {
"$sum": "$_id.rating"
}
}
}
],
"reviews": [
{
"$skip": 0.0
},
{
"$limit": 10.0
}
]
}
},
{
"$unwind": "$numbers"
},
{
"$project": {
"_id": "$numbers._id",
"reviews": "$reviews",
"avgRating": {"$divide": ["$numbers.totalRating", "$numbers.totalItemCount"]},
"counts": "$numbers.counts"
}
}
]);

How to distinct (count) value when value is nested in array?

I have a data sample something like this:
"diagnostics" : {
"_ID" : "554bbf7b761e06f02fef3561",
"tests" : [
{
"_id" : "59d678064e4645ec562a37e2",
"name" : "RBC",
},
{
"_id" : "59d678064e4645ec562a37e1",
"name" : "Calcium",
}
]
}
I want to get all distinct _ID and count of all test groups in with there names
which is something like this:
"_ID" : "554bbf7b761e06f02fef3561"{ {"name" : "Calcium", count :(count of Calcium)},{"name" : "RBC", count :(count of RBC)}
Thing to keep in mind are tests is inside diagnostics and contain any number of $name field it can be two or one or any number of times and I want individual count of each distinct name .
db.collection('transactions').aggregate([
{ $unwind : '$diagnostics.tests' },
{ $group : {
_id: {
"Test_Name" : '$diagnostics.tests.name',
"ID" : '$diagnostics._id'
},
test_count: { $sum: 1 }
}
}
])
and I am getting result something like this
[
{
"_id": {
"Test_Name": "Fasting Blood Sugar",
"ID": "554bbf7b761e06f02fef3561"
},
"test_count": 76
},
{
"_id": {
"Test_Name": "Fasting Blood Sugar",
"ID": "566726c35dc18d13242fffcc"
},
"test_count": 1
},
{
"_id": {
"Test_Name": "CBC - 7 Part",
"ID": "566726c35dc18d13242fffcc"
},
"test_count": 1
},
{
"_id": {
"Test_Name": "RBC",
"ID": "554bbf7b761e06f02fef3561"
},
"test_count": 1
},
{
"_id": {
"Test_Name": "Fasting Blood Sugar",
"ID": "5a2c9edfe0d0ec71aef1e526"
},
"test_count": 6
},
{
"_id": {
"Test_Name": "Calcium",
"ID": "554bbf7b761e06f02fef3561"
},
"test_count": 77
}
]
Can anybody help me with the query?
You need to use mulitple $group stages here.
First $unwind the tests and $group it by "name" and then resize it to original and lastly then $group by "diagnostics_ID" and for the tests count you can check the $size of the "tests" array.
db.collection.aggregate([
{ "$unwind": "$diagnostics.tests" },
{ "$group": {
"_id": {
"_id": "$diagnostics.tests.name",
"diagnosticID": "$diagnostics._ID"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": {
"_ID": "$_id.diagnosticID"
},
"tests": {
"$push": {
"name": "$_id._id",
"count": "$count"
}
}
}},
{ "$project": {
"diagnostics._ID": "$_id._ID",
"diagnostics.tests": "$tests",
"_id": 0,
"testCount": { "$size": "$tests" }
}}
])
Code snippet

how get count from mongodb with different status from one collection

I have appointment collection in that i have status codes like upcoming, cancelled, completed. i want to write an api to get count of each status using mongoose or mongodb methods.
output should be like below
[{
group : "grp1",
appointments_completed :4
appointments_upcoming :5
appointments_cancelled : 7
}]
thanks in advance.
I hope it help you
db.getCollection('codelist').aggregate([
{
$group:{
_id:{status:"$status"},
count:{$sum:1}
}
}
])
The result will be
[{
"_id" : {
"status" : "cancelled"
},
"count" : 13.0
},
{
"_id" : {
"status" : "completed"
},
"count" : 20.0
}
]
I think you can process it with nodejs
Using Aggregation Pipeline $group we can get this count
db.collection_name.aggregate([
{ $group: {
_id:null,
appointments_completed: {$sum : "$appointments_completed" },
appointments_upcoming:{$sum :"$appointments_upcoming"},
appointments_cancelled:{$sum: "$appointments_cancelled"}
}
}
]);
With MongoDb 3.6 and newer, you can leverage the use of $arrayToObject operator and a $replaceRoot pipeline to get the desired result. You would need to run the following aggregate pipeline:
db.appointments.aggregate([
{ "$group": {
"_id": {
"group": <group_by_field>,
"status": { "$concat": ["appointments_", { "$toLower": "$status" }] }
},
"count": { "$sum": 1 }
} },
{ "$group": {
"_id": "$_id.group",
"counts": {
"$push": {
"k": "$_id.status",
"v": "$count"
}
}
} },
{ "$addFields": {
"counts": {
"$setUnion": [
"$counts", [
{
"k": "group",
"v": "$_id"
}
]
]
}
} },
{ "$replaceRoot": {
"newRoot": { "$arrayToObject": "$counts" }
} }
])
For older versions, a more generic approach though with a different output format would be to group twice and get the counts as an array of key value objects as in the following:
db.appointments.aggregate([
{ "$group": {
"_id": {
"group": <group_by_field>,
"status": { "$toLower": "$status" }
},
"count": { "$sum": 1 }
} },
{ "$group": {
"_id": "$_id.group",
"counts": {
"$push": {
"status": "$_id.status",
"count": "$count"
}
}
} }
])
which spits out:
{
"_id": "grp1"
"counts":[
{ "status": "completed", "count": 4 },
{ "status": "upcoming", "count": 5 }
{ "status": "cancelled", "count": 7 }
]
}
If the status codes are fixed then the $cond operator in the $group pipeline step can be used effectively to evaluate the counts based on the status field value. Your overall aggregation pipeline can be constructed as follows to produce the result in the desired format:
db.appointments.aggregate([
{ "$group": {
"_id": <group_by_field>,
"appointments_completed": {
"$sum": {
"$cond": [ { "$eq": [ "$status", "completed" ] }, 1, 0 ]
}
},
"appointments_upcoming": {
"$sum": {
"$cond": [ { "$eq": [ "$status", "upcoming" ] }, 1, 0 ]
}
},
"appointments_cancelled": {
"$sum": {
"$cond": [ { "$eq": [ "$status", "cancelled" ] }, 1, 0 ]
}
}
} }
])

recover last value list of variables MongoDB [duplicate]

For example, I have these documents:
{
"addr": "address1",
"book": "book1"
},
{
"addr": "address2",
"book": "book1"
},
{
"addr": "address1",
"book": "book5"
},
{
"addr": "address3",
"book": "book9"
},
{
"addr": "address2",
"book": "book5"
},
{
"addr": "address2",
"book": "book1"
},
{
"addr": "address1",
"book": "book1"
},
{
"addr": "address15",
"book": "book1"
},
{
"addr": "address9",
"book": "book99"
},
{
"addr": "address90",
"book": "book33"
},
{
"addr": "address4",
"book": "book3"
},
{
"addr": "address5",
"book": "book1"
},
{
"addr": "address77",
"book": "book11"
},
{
"addr": "address1",
"book": "book1"
}
and so on.How can I make a request, which will describe the top N addresses and the top M books per address?Example of expected result: address1 | book_1: 5 | book_2: 10 | book_3: 50 | total: 65 ______________________ address2 | book_1: 10 | book_2: 10 |... | book_M: 10 | total: M*10... ______________________ addressN | book_1: 20 | book_2: 20 |... | book_M: 20 | total: M*20
TLDR Summary
In modern MongoDB releases you can brute force this with $slice just off the basic aggregation result. For "large" results, run parallel queries instead for each grouping ( a demonstration listing is at the end of the answer ), or wait for SERVER-9377 to resolve, which would allow a "limit" to the number of items to $push to an array.
db.books.aggregate([
{ "$group": {
"_id": {
"addr": "$addr",
"book": "$book"
},
"bookCount": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.addr",
"books": {
"$push": {
"book": "$_id.book",
"count": "$bookCount"
},
},
"count": { "$sum": "$bookCount" }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$project": {
"books": { "$slice": [ "$books", 2 ] },
"count": 1
}}
])
MongoDB 3.6 Preview
Still not resolving SERVER-9377, but in this release $lookup allows a new "non-correlated" option which takes an "pipeline" expression as an argument instead of the "localFields" and "foreignFields" options. This then allows a "self-join" with another pipeline expression, in which we can apply $limit in order to return the "top-n" results.
db.books.aggregate([
{ "$group": {
"_id": "$addr",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$lookup": {
"from": "books",
"let": {
"addr": "$_id"
},
"pipeline": [
{ "$match": {
"$expr": { "$eq": [ "$addr", "$$addr"] }
}},
{ "$group": {
"_id": "$book",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
],
"as": "books"
}}
])
The other addition here is of course the ability to interpolate the variable through $expr using $match to select the matching items in the "join", but the general premise is a "pipeline within a pipeline" where the inner content can be filtered by matches from the parent. Since they are both "pipelines" themselves we can $limit each result separately.
This would be the next best option to running parallel queries, and actually would be better if the $match were allowed and able to use an index in the "sub-pipeline" processing. So which is does not use the "limit to $push" as the referenced issue asks, it actually delivers something that should work better.
Original Content
You seem have stumbled upon the top "N" problem. In a way your problem is fairly easy to solve though not with the exact limiting that you ask for:
db.books.aggregate([
{ "$group": {
"_id": {
"addr": "$addr",
"book": "$book"
},
"bookCount": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.addr",
"books": {
"$push": {
"book": "$_id.book",
"count": "$bookCount"
},
},
"count": { "$sum": "$bookCount" }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
])
Now that will give you a result like this:
{
"result" : [
{
"_id" : "address1",
"books" : [
{
"book" : "book4",
"count" : 1
},
{
"book" : "book5",
"count" : 1
},
{
"book" : "book1",
"count" : 3
}
],
"count" : 5
},
{
"_id" : "address2",
"books" : [
{
"book" : "book5",
"count" : 1
},
{
"book" : "book1",
"count" : 2
}
],
"count" : 3
}
],
"ok" : 1
}
So this differs from what you are asking in that, while we do get the top results for the address values the underlying "books" selection is not limited to only a required amount of results.
This turns out to be very difficult to do, but it can be done though the complexity just increases with the number of items you need to match. To keep it simple we can keep this at 2 matches at most:
db.books.aggregate([
{ "$group": {
"_id": {
"addr": "$addr",
"book": "$book"
},
"bookCount": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.addr",
"books": {
"$push": {
"book": "$_id.book",
"count": "$bookCount"
},
},
"count": { "$sum": "$bookCount" }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$unwind": "$books" },
{ "$sort": { "count": 1, "books.count": -1 } },
{ "$group": {
"_id": "$_id",
"books": { "$push": "$books" },
"count": { "$first": "$count" }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"books": "$books",
"count": "$count"
},
"newBooks": "$books"
}},
{ "$unwind": "$newBooks" },
{ "$group": {
"_id": "$_id",
"num1": { "$first": "$newBooks" }
}},
{ "$project": {
"_id": "$_id",
"newBooks": "$_id.books",
"num1": 1
}},
{ "$unwind": "$newBooks" },
{ "$project": {
"_id": "$_id",
"num1": 1,
"newBooks": 1,
"seen": { "$eq": [
"$num1",
"$newBooks"
]}
}},
{ "$match": { "seen": false } },
{ "$group":{
"_id": "$_id._id",
"num1": { "$first": "$num1" },
"num2": { "$first": "$newBooks" },
"count": { "$first": "$_id.count" }
}},
{ "$project": {
"num1": 1,
"num2": 1,
"count": 1,
"type": { "$cond": [ 1, [true,false],0 ] }
}},
{ "$unwind": "$type" },
{ "$project": {
"books": { "$cond": [
"$type",
"$num1",
"$num2"
]},
"count": 1
}},
{ "$group": {
"_id": "$_id",
"count": { "$first": "$count" },
"books": { "$push": "$books" }
}},
{ "$sort": { "count": -1 } }
])
So that will actually give you the top 2 "books" from the top two "address" entries.
But for my money, stay with the first form and then simply "slice" the elements of the array that are returned to take the first "N" elements.
Demonstration Code
The demonstration code is appropriate for usage with current LTS versions of NodeJS from v8.x and v10.x releases. That's mostly for the async/await syntax, but there is nothing really within the general flow that has any such restriction, and adapts with little alteration to plain promises or even back to plain callback implementation.
index.js
const { MongoClient } = require('mongodb');
const fs = require('mz/fs');
const uri = 'mongodb://localhost:27017';
const log = data => console.log(JSON.stringify(data, undefined, 2));
(async function() {
try {
const client = await MongoClient.connect(uri);
const db = client.db('bookDemo');
const books = db.collection('books');
let { version } = await db.command({ buildInfo: 1 });
version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);
// Clear and load books
await books.deleteMany({});
await books.insertMany(
(await fs.readFile('books.json'))
.toString()
.replace(/\n$/,"")
.split("\n")
.map(JSON.parse)
);
if ( version >= 3.6 ) {
// Non-correlated pipeline with limits
let result = await books.aggregate([
{ "$group": {
"_id": "$addr",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 },
{ "$lookup": {
"from": "books",
"as": "books",
"let": { "addr": "$_id" },
"pipeline": [
{ "$match": {
"$expr": { "$eq": [ "$addr", "$$addr" ] }
}},
{ "$group": {
"_id": "$book",
"count": { "$sum": 1 },
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
]
}}
]).toArray();
log({ result });
}
// Serial result procesing with parallel fetch
// First get top addr items
let topaddr = await books.aggregate([
{ "$group": {
"_id": "$addr",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
]).toArray();
// Run parallel top books for each addr
let topbooks = await Promise.all(
topaddr.map(({ _id: addr }) =>
books.aggregate([
{ "$match": { addr } },
{ "$group": {
"_id": "$book",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 2 }
]).toArray()
)
);
// Merge output
topaddr = topaddr.map((d,i) => ({ ...d, books: topbooks[i] }));
log({ topaddr });
client.close();
} catch(e) {
console.error(e)
} finally {
process.exit()
}
})()
books.json
{ "addr": "address1", "book": "book1" }
{ "addr": "address2", "book": "book1" }
{ "addr": "address1", "book": "book5" }
{ "addr": "address3", "book": "book9" }
{ "addr": "address2", "book": "book5" }
{ "addr": "address2", "book": "book1" }
{ "addr": "address1", "book": "book1" }
{ "addr": "address15", "book": "book1" }
{ "addr": "address9", "book": "book99" }
{ "addr": "address90", "book": "book33" }
{ "addr": "address4", "book": "book3" }
{ "addr": "address5", "book": "book1" }
{ "addr": "address77", "book": "book11" }
{ "addr": "address1", "book": "book1" }
Using aggregate function like below :
[
{$group: {_id : {book : '$book',address:'$addr'}, total:{$sum :1}}},
{$project : {book : '$_id.book', address : '$_id.address', total : '$total', _id : 0}}
]
it will give you result like following :
{
"total" : 1,
"book" : "book33",
"address" : "address90"
},
{
"total" : 1,
"book" : "book5",
"address" : "address1"
},
{
"total" : 1,
"book" : "book99",
"address" : "address9"
},
{
"total" : 1,
"book" : "book1",
"address" : "address5"
},
{
"total" : 1,
"book" : "book5",
"address" : "address2"
},
{
"total" : 1,
"book" : "book3",
"address" : "address4"
},
{
"total" : 1,
"book" : "book11",
"address" : "address77"
},
{
"total" : 1,
"book" : "book9",
"address" : "address3"
},
{
"total" : 1,
"book" : "book1",
"address" : "address15"
},
{
"total" : 2,
"book" : "book1",
"address" : "address2"
},
{
"total" : 3,
"book" : "book1",
"address" : "address1"
}
I didn't quite get your expected result format, so feel free to modify this to one you need.
Below query will provide exactly the same result as given in the desired response:
db.books.aggregate([
{
$group: {
_id: { addresses: "$addr", books: "$book" },
num: { $sum :1 }
}
},
{
$group: {
_id: "$_id.addresses",
bookCounts: { $push: { bookName: "$_id.books",count: "$num" } }
}
},
{
$project: {
_id: 1,
bookCounts:1,
"totalBookAtAddress": {
"$sum": "$bookCounts.count"
}
}
}
])
The response will be looking like below:
/* 1 */
{
"_id" : "address4",
"bookCounts" : [
{
"bookName" : "book3",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 2 */
{
"_id" : "address90",
"bookCounts" : [
{
"bookName" : "book33",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 3 */
{
"_id" : "address15",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 4 */
{
"_id" : "address3",
"bookCounts" : [
{
"bookName" : "book9",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 5 */
{
"_id" : "address5",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 6 */
{
"_id" : "address1",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 3
},
{
"bookName" : "book5",
"count" : 1
}
],
"totalBookAtAddress" : 4
},
/* 7 */
{
"_id" : "address2",
"bookCounts" : [
{
"bookName" : "book1",
"count" : 2
},
{
"bookName" : "book5",
"count" : 1
}
],
"totalBookAtAddress" : 3
},
/* 8 */
{
"_id" : "address77",
"bookCounts" : [
{
"bookName" : "book11",
"count" : 1
}
],
"totalBookAtAddress" : 1
},
/* 9 */
{
"_id" : "address9",
"bookCounts" : [
{
"bookName" : "book99",
"count" : 1
}
],
"totalBookAtAddress" : 1
}
Since mongoDB version 3.6 this is easy to do, using $group, $slice, $limit, and $sort:
$group the books to count them
$sort so they will be later pushed according to count
$group by address, $push relevant books, and $sum the total per address.
$sort by address total
$limit the address results to topN
Limit the books in the array to topM using $slice
db.collection.aggregate([
{$group: {_id: {book: "$book", addr: "$addr"}, count: {$sum: 1}}},
{$sort: {"_id.addr": 1, count: -1}},
{$group: {
_id: "$_id.addr", totalCount: {$sum: "$count"},
books: {$push: {book: "$_id.book", count: "$count"}}
}
},
{$sort: {totalCount: -1}},
{$limit: topN},
{$set: {addr: "$_id", _id: "$$REMOVE", books: {$slice: ["$books", 0, topM]}}}
])
See how it works on the playground example-v3.4
On mongoDB version 5.2 there is a topN accumulator that can simplify even more:
db.collection.aggregate([
{$group: {_id: {book: "$book", addr: "$addr"}, count: {$sum: 1}}},
{$group: {
_id: "$_id.addr",
totalCount: {$sum: "$count"},
books: {$topN: {output: {book: "$_id.book", count: "$count"},
sortBy: {count: -1},
n: topM
}}
}},
{$sort: {totalCount: -1}},
{$limit: topN},
{$project: {addr: "$_id", _id: 0, books: 1, totalCount: 1}}
])
See how it works on the playground example-v5.2

Partition of Data with MongoDB

I have following collection
[
{
"setting": "Volume",
"_id": ObjectId("5a934e000102030405000000"),
"counting": 1
},
{
"setting": "Brightness",
"_id": ObjectId("5a934e000102030405000001"),
"counting": 1
},
{
"setting": "Contrast",
"_id": ObjectId("5a934e000102030405000002"),
"counting": 1
},
{
"setting": "Contrast",
"_id": ObjectId("5a934e000102030405000003"),
"counting": 1
},
{
"setting": "Contrast",
"_id": ObjectId("5a934e000102030405000004"),
"counting": 0
},
{
"setting": "Sharpness",
"_id": ObjectId("5a934e000102030405000005"),
"counting": 1
},
{
"setting": "Sharpness",
"_id": ObjectId("5a934e000102030405000006"),
"counting": 1
},
{
"setting": "Language",
"_id": ObjectId("5a934e000102030405000007"),
"counting": 1
},
{
"setting": "Language",
"_id": ObjectId("5a934e000102030405000008"),
"counting": 0
}
]
Now I want to group by setting and want only top most two data in result rest in useless
So my output should be after sort by counting
[
{
"setting": "Contrast",
"counting": 2
},
{
"setting": "Sharpness",
"counting": 2
},
{
"setting": "Useless",
"counting": 3
}
]
If you can get away with it, then it's probably best to "stuff" the reduced results into a single document and then $slice the top two and $sum the rest:
Model.aggregate([
{ "$group": {
"_id": "$setting",
"counting": { "$sum": "$counting" }
}},
{ "$sort": { "counting": -1 } },
{ "$group": {
"_id": null,
"data": { "$push": "$$ROOT" }
}},
{ "$addFields": {
"data": {
"$let": {
"vars": { "top": { "$slice": ["$data", 0, 2 ] } },
"in": {
"$concatArrays": [
"$$top",
{ "$cond": {
"if": { "$gt": [{ "$size": "$data" }, 2] },
"then":
[{
"_id": "Useless",
"counting": {
"$sum": {
"$map": {
"input": {
"$filter": {
"input": "$data",
"cond": { "$not": { "$in": [ "$$this._id", "$$top._id" ] } }
}
},
"in": "$$this.counting"
}
}
}
}],
"else": []
}}
]
}
}
}
}},
{ "$unwind": "$data" },
{ "$replaceRoot": { "newRoot": "$data" } }
])
If it's potentially a very "large" result even reduced, then $limit use a $facet for the "rest":
Model.aggregate([
{ "$facet": {
"top": [
{ "$group": {
"_id": "$setting",
"counting": { "$sum": "$counting" }
}},
{ "$sort": { "counting": -1 } },
{ "$limit": 2 }
],
"rest": [
{ "$group": {
"_id": "$setting",
"counting": { "$sum": "$counting" }
}},
{ "$sort": { "counting": -1 } },
{ "$skip": 2 },
{ "$group": {
"_id": "Useless",
"counting": { "$sum": "$counting" }
}}
]
}},
{ "$project": {
"data": {
"$concatArrays": [
"$top","$rest"
]
}
}},
{ "$unwind": "$data" },
{ "$replaceRoot": { "newRoot": "$data" } }
])
Or even $lookup with MongoDB 3.6:
Model.aggregate([
{ "$group": {
"_id": "$setting",
"counting": { "$sum": "$counting" }
}},
{ "$sort": { "counting": -1 } },
{ "$limit": 2 },
{ "$group": {
"_id": null,
"top": { "$push": "$$ROOT" }
}},
{ "$lookup": {
"from": "colllection",
"let": { "settings": "$top._id" },
"pipeline": [
{ "$match": {
"$expr": {
"$not": { "$in": [ "$setting", "$$settings" ] }
}
}},
{ "$group": {
"_id": "Useless",
"counting": { "$sum": "$counting" }
}}
],
"as": "rest"
}},
{ "$project": {
"data": {
"$concatArrays": [ "$top", "$rest" ]
}
}},
{ "$unwind": "$data" },
{ "$replaceRoot": { "newRoot": "$data" } }
])
All pretty much the same really, and all return the same result:
{ "_id" : "Contrast", "counting" : 2 }
{ "_id" : "Sharpness", "counting" : 2 }
{ "_id" : "Useless", "counting" : 3 }
Optionally $project right at the end of each instead of the $replaceRoot if control over the field names is really important to you. Generally I just stick with the $group defaults
In the event that your MongoDB predates 3.4 and the resulting "Useless" remainder is actually too large to use any variant of the first approach, then simple Promise resolution is basically the answer, being one for the aggregate and the other for a basic count and simply do the math:
let [docs, count] = await Promise.all([
Model.aggregate([
{ "$group": {
"_id": "$setting",
"counting": { "$sum": "$counting" }
}},
{ "$sort": { "counting": -1 } },
{ "$limit": 2 },
]),
Model.count().exec()
]);
docs = [
...docs,
{
"_id": "Useless",
"counting": count - docs.reduce((o,e) => o + e.counting, 0)
}
];
Or without the async/await:
Promise.all([
Model.aggregate([
{ "$group": {
"_id": "$setting",
"counting": { "$sum": "$counting" }
}},
{ "$sort": { "counting": -1 } },
{ "$limit": 2 },
]),
Model.count().exec()
]).then(([docs, count]) => ([
...docs,
{
"_id": "Useless",
"counting": count - docs.reduce((o,e) => o + e.counting, 0)
}
]).then( result => /* do something */ )
Which is basically a variation on the age old "total pages" approach by simply running the separate query to count the collection items.
Running separate requests is generally the age old way of doing this and it often performs best. The rest of the solutions are essentially aimed at "aggregation tricks" since that was what you were asking for, and that's the answer you got by showing different variations on the same thing.
One variant put's all results into a single document ( where possible, due to the BSON limit of course ) and the others basically vary on the "age old" approach by running the query again in a different form. $facet in parallel and $lookup in series.

Resources