How to delete duplicate documents in MongoDB after running aggregate - python-3.x

I have the following aggregate that displays all the duplicates in my DB:
db.Articles.aggregate([
{"$group" : { "_id": "$url", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"url" : "$_id", "_id" : 0} }
]);
My question is, how do I delete the results after running this aggregate?

Related

Mongodb aggregate $group stage takes a long time

I'm practicing how to use MongoDB aggregation, but they seem to take a really long time (running time).
The problem seems to happen whenever I use $group. All other queries run just fine.
I have some 1.3 million dummy documents that need to perform two basic operations: get a count of the IP addresses and unique IP addresses.
My schema looks something like this:
{
"_id":"5da51af103eb566faee6b8b4",
"ip_address":"...",
"country":"CL",
"browser":{
"user_agent":...",
}
}
Running a basic $group query takes about 12s on average, which is much too slow.
I did a little research, and someone suggested creating an index on ip_addresses. That seems to have slowed it down because queries now take 13-15s.
I use MongoDB and the query I'm running looks like this:
visitorsModel.aggregate([
{
'$group': {
'_id': '$ip_address',
'count': {
'$sum': 1
}
}
}
]).allowDiskUse(true)
.exec(function (err, docs) {
if (err) throw err;
return res.send({
uniqueCount: docs.length
})
})
Any help is appreciated.
Edit: I forgot to mention, someone suggested it might be a hardware issue? I'm running the query on a core i5, 8GB RAM laptop if it helps.
Edit 2: The query plan:
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"ip_address" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "metrics.visitors",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1387324,
"executionTimeMillis" : 7671,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1387324,
"executionStages" : {
"stage" : "COLLSCAN",
"nReturned" : 1387324,
"executionTimeMillisEstimate" : 9,
"works" : 1387326,
"advanced" : 1387324,
"needTime" : 1,
"needYield" : 0,
"saveState" : 10930,
"restoreState" : 10930,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 1387324
}
}
}
},
{
"$group" : {
"_id" : "$ip_address",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
This is some info about using $group aggregation stage, if it uses indexes, and its limitations and what can be tried to overcome these.
1. The $group Stage Doesn't Use Index:
Mongodb Aggregation: Does $group use index?
2. $group Operator and Memory:
The $group stage has a limit of 100 megabytes of RAM. By default, if
the stage exceeds this limit, $group returns an error. To allow for
the handling of large datasets, set the allowDiskUse option to true.
This flag enables $group operations to write to temporary files.
See MongoDb docs on $group Operator and Memory
3. An Example Using $group and Count:
A collection called as cities:
{ "_id" : 1, "city" : "Bangalore", "country" : "India" }
{ "_id" : 2, "city" : "New York", "country" : "United States" }
{ "_id" : 3, "city" : "Canberra", "country" : "Australia" }
{ "_id" : 4, "city" : "Hyderabad", "country" : "India" }
{ "_id" : 5, "city" : "Chicago", "country" : "United States" }
{ "_id" : 6, "city" : "Amritsar", "country" : "India" }
{ "_id" : 7, "city" : "Ankara", "country" : "Turkey" }
{ "_id" : 8, "city" : "Sydney", "country" : "Australia" }
{ "_id" : 9, "city" : "Srinagar", "country" : "India" }
{ "_id" : 10, "city" : "San Francisco", "country" : "United States" }
Query the collection to count the cities by each country:
db.cities.aggregate( [
{ $group: { _id: "$country", cityCount: { $sum: 1 } } },
{ $project: { country: "$_id", _id: 0, cityCount: 1 } }
] )
The Result:
{ "cityCount" : 3, "country" : "United States" }
{ "cityCount" : 1, "country" : "Turkey" }
{ "cityCount" : 2, "country" : "Australia" }
{ "cityCount" : 4, "country" : "India" }
4. Using allowDiskUse Option:
db.cities.aggregate( [
{ $group: { _id: "$country", cityCount: { $sum: 1 } } },
{ $project: { country: "$_id", _id: 0, cityCount: 1 } }
], { allowDiskUse : true } )
Note, in this case it makes no difference in query performance or output. This is to show the usage only.
5. Some Options to Try (suggestions):
You can try a few things to get some result (for trial purposes only):
Use $limit stage and restrict the number of documents processed and
see what is the result. For example, you can try { $limit: 1000 }.
Note this stage needs to come before the $group stage.
You can also use the $match, $project stages before the $group
stage to control the shape and size of the input. This may
return a result (instead of an error).
[EDIT ADD]
Notes on Distinct and Count:
Using the same cities collection - to get unique countries and a count of them you can try using the aggregate stage $count along with $group as in the following two queries.
Distinct:
db.cities.aggregate( [
{ $match: { country: { $exists: true } } },
{ $group: { _id: "$country" } },
{ $project: { country: "$_id", _id: 0 } }
] )
The Result:
{ "country" : "United States" }
{ "country" : "Turkey" }
{ "country" : "India" }
{ "country" : "Australia" }
To get the above result as a single document with an array of unique values, use the $addToSetoperator:
db.cities.aggregate( [
{ $match: { country: { $exists: true } } },
{ $group: { _id: null, uniqueCountries: { $addToSet: "$country" } } },
{ $project: { _id: 0 } },
] )
The Result: { "uniqueCountries" : [ "United States", "Turkey", "India", "Australia" ] }
Count:
db.cities.aggregate( [
{ $match: { country: { $exists: true } } },
{ $group: { _id: "$country" } },
{ $project: { country: "$_id", _id: 0 } },
{ $count: "uniqueCountryCount" }
] )
The Result: { "uniqueCountryCount" : 4 }
In the above queries the $match stage is used to filter any documents with non-existing or null countryfield. The $project stage reshapes the result document(s).
MongoDB Query Language:
Note the two queries get similar results when using the MongoDB query language commands: db.collection.distinct("country") and db.cities.distinct("country").length (note the distinct returns an array).
You can create index
db.collectionname.createIndex( { ip_address: "text" } )
Try this, it is more faster.
I think it will help you.

how to get data in mongoose where last element in array?

how to get data in mongoose where last element in array?
I have data looks like this:
[
{
"_id" : ObjectId("5b56eb3deb869312d85a8e69"),
"transactionStatus" : [
{
"status" : "pending",
"createdAt" : ISODate("2018-07-24T09:02:53.347Z")
},
{
"status" : "process",
"createdAt" : ISODate("2018-07-24T09:02:53.347Z")
}
]
},
{
"_id" : ObjectId("5b56eb3deb869312d8589765"),
"transactionStatus" : [
{
"status" : "pending",
"createdAt" : ISODate("2018-07-24T09:02:53.347Z")
},
{
"status" : "process",
"createdAt" : ISODate("2018-07-24T09:03:30.347Z")
},
{
"status" : "done",
"createdAt" : ISODate("2018-07-24T09:04:22.347Z")
}
]
}
]
And, I want to get data above where last object transactionStatus.status = process, so the result should be:
{
"_id" : ObjectId("5b56eb3deb869312d85a8e69"),
"transactionStatus" : [
{
"status" : "pending",
"createdAt" : ISODate("2018-07-24T09:02:53.347Z")
},
{
"status" : "process",
"createdAt" : ISODate("2018-07-24T09:02:53.347Z")
}
]
}
how to do that with mongoose?
You can use $expr (MongoDB 3.6+) inside of match. Using $let and $arrayElemAt passing -1 as second argument you can get the last element as a temporary variable and then you can compare the values:
db.col.aggregate([
{
$match: {
$expr: {
$let: {
vars: { last: { $arrayElemAt: [ "$transactionStatus", -1 ] } },
in: { $eq: [ "$$last.status", "process" ] }
}
}
}
}
])
The same result can be achieved for lower versions of MongoDB using $addFields and $match. You can add $project then to remove that temporary field:
db.col.aggregate([
{
$addFields: {
last: { $arrayElemAt: [ "$transactionStatus", -1 ] }
}
},
{
$match: { "last.status": "process" }
},
{
$project: { last: 0 }
}
])
//Always update new status at Position 0 using $position operator
db.update({
"_id": ObjectId("5b56eb3deb869312d85a8e69")
},
{
"$push": {
"transactionStatus": {
"$each": [
{
"status": "process",
"createdAt": ISODate("2018-07-24T09:02:53.347Z")
}
],
"$position": 0
}
}
}
)
//Your Query for checking first element status is process
db.find(
{
"transactionStatus.0.status": "process"
}
)
refer $position, $each

Mongoose array elements ALL $in array

I have an array like this:
{
"_id" : ObjectId("581b7d650949a5204e0a6e9b"),
"types" : [
{
"type" : ObjectId("581b7c645057c4602f48627f"),
"quantity" : 4,
"_id" : ObjectId("581b7d650949a5204e0a6e9e")
},
{
"type" : ObjectId("581ca0e75b1e3058521a6d8c"),
"quantity" : 4,
"_id" : ObjectId("581b7d650949a5204e0a6e9e")
}
],
"__v" : 0
},
{
"_id" : ObjectId("581b7d650949a5204e0a6e9c"),
"types" : [
{
"type" : ObjectId("581b7c645057c4602f48627f"),
"quantity" : 4,
"_id" : ObjectId("581b7d650949a5204e0a6e9e")
}
],
"__v" : 0
}
And I want to create a query that will return me the elementswhere the array of types ALL match a $in array.
For example:
query([ObjectId("581b7c645057c4602f48627f"), ObjectId("581ca0e75b1e3058521a6d8c")])
should return elements 1 and 2
query([ObjectId("581b7c645057c4602f48627f")])
should return element 2
query([ObjectId("581ca0e75b1e3058521a6d8c")])
should return nothing
I tried
db.getCollection('elements').find({'types.type': { $in: [ObjectId("581ca0e75b1e3058521a6d8c")]}})
But it returns the elements if only one types matches
You may have to use aggregation as $in and $elematch will return only matching elements. Project stage does set equals to create a all match flag and matches in the last stage with true value.
aggregate([ {
$project: {
_id: 0,
isAllMatch: {$setIsSubset: ["$types.type", [ObjectId("581b7c645057c4602f48627f")]]},
data: "$$ROOT"
}
}, {
$match: {
isAllMatch: true
}
}])
Sample Output
{
"isAllMatch": true,
"data": {
"_id": ObjectId("581b7d650949a5204e0a6e9c"),
"types": [{
"type": ObjectId("581b7c645057c4602f48627f"),
"quantity": 4,
"_id": ObjectId("581b7d650949a5204e0a6e9e")
}],
"__v": 0
}
}
Alternative version:
This version combines both project and match stages into one $redact stage with $cond operator to decide whether to keep or prune the elements.
aggregate([{
"$redact": {
"$cond": [{
$setIsSubset: ["$types.type", [ObjectId("581b7c645057c4602f48627f")]]
},
"$$KEEP",
"$$PRUNE"
]
}
}])

Mongo Node driver how to get all fields of $max aggregate from an array of objects

I have a collection called "products" which has an array of "bids" objects.
I want to find out the Maximum bid for each product, for this I am aggregating Products on $max with $bids.bidamount field. However this is only giving me the largest bid amount. How do I project all the bid fields for the max aggregation.
Here is a sample document
{
"_id" : ObjectId("58109a5138fe12215cfdc064"),
"product_id" : 2,
"item_name" : "Auction Item1",
"item_description" : "Test",
"seller_name" : "ak#gmail.com",
"item_price" : "20",
"item_quantity" : 7,
"sale_type" : "Auction",
"posted_at" : "2016:10:26 04:58:09",
"expires_at" : "2016:10:30 04:58:09",
"bids" : [
{
"bid_id" : 1,
"bidder" : "ak#gmail.com",
"bid_amount" : 300,
"bit_time" : "2016:10:26 22:36:29"
},
{
"bid_id" : 2,
"bidder" : "ak#gmail.com",
"bid_amount" : 100,
"bit_time" : "2016:10:26 22:37:29"
}
],
"orders" : [
{
"buyer" : "ak#gmail.com",
"quantity" : "2"
},
{
"buyer" : "ak#gmail.com",
"quantity" : "3"
}
]
}
Here is my mongo query:
db.products.aggregate([
{
$project: {
bidMax: { $max: "$bids.bid_amount"}
}
}
])
which gives the following result:
{
"_id" : ObjectId("58109a5138fe12215cfdc064"),
"bidMax" : 300
}
db.products.aggregate([{$unwind:"$bids"},{$group:{_id:"$_id", sum:{$sum:"$bids.bid_amount"}}},{$project:{doc:"$$ROOT", _id:1, sum:1}, {$sort:{"sum":-1}},{$limit:1}]),
which return something like { "_id" : ObjectId("5811b667c50fb1ec88227860"), "sum" : 600, doc:{your document....} }
This should do it:
db.products.aggregate([{
$unwind: '$bids'
}, {
$group: {
_id: '$products_id',
maxBid: {
$max: '$bids.bid_amount'
}
}
}])
db.collectionName.aggregate(
[
{
$group:
{
_id: "$product_id",
maxBidAmount: { $max: "$bids.bid_amount" }
}
}
]
)
Hey use this query, you will get the result.

Undo Unwind in aggregate in mongodb

I have multiple data something like this
{
"_id" : ObjectId("57189fcd72b6e0480ed7a0a9"),
"venueId" : ObjectId("56ce9ead08daba400d14edc9"),
"companyId" : ObjectId("56e7d62ecc0b8fc812b2aac5"),
"cardTypeId" : ObjectId("56cea8acd82cd11004ee67a9"),
"matchData" : [
{
"matchId" : ObjectId("57175c25561d87001e666d12"),
"matchDate" : ISODate("2016-04-08T18:30:00.000Z"),
"matchTime" : "20:00:00",
"_id" : ObjectId("57189fcd72b6e0480ed7a0ab"),
"active" : 3,
"cancelled" : 0,
"produced" : 3
},
{
"matchId" : ObjectId("57175c25561d87001e666d13"),
"matchDate" : ISODate("2016-04-09T18:30:00.000Z"),
"matchTime" : "20:00:00",
"_id" : ObjectId("57189fcd72b6e0480ed7a0aa"),
"active" : null,
"cancelled" : null,
"produced" : null
}
],
"__v" : 0
}
i m doing group by companyId and its work fine But i want to search in matchData based on matchtime and matchId For that purpose i am $unwind matchData after unwind i using my search query like this
db.getCollection('matchWiseData').aggregate([
{"$match":{
"matchData.matchId":{"$in":[ObjectId("57175c25561d87001e666d12")]}
}},
{"$unwind":"$matchData"},
{"$match":{
"matchData.matchId":{"$in":[ObjectId("57175c25561d87001e666d12")]}}
}])
its give me proper result but after applying unwind is there any way to undo it I m using unwind to just search inside subdocument or there is any other way to search inside subdocument.
Well you can of course just use $push and $first in a $group to get the document back to what it was:
db.getCollection('matchWiseData').aggregate([
{ "$match":{
"matchData.matchId":{"$in":[ObjectId("57175c25561d87001e666d12")]}
}},
{ "$unwind":"$matchData"},
{ "$match":{
"matchData.matchId":{"$in":[ObjectId("57175c25561d87001e666d12")]}
}},
{ "$group": {
"_id": "$_id",
"venueId": { "$first": "$venueId" },
"companyId": { "$first": "$companyId" },
"cardTypeId": { "$first": "$cardTypeId" },
"matchData": { "$push": "$matchData" }
}}
])
But you probably should have just used $filter with MongoDB 3.2 in the first place:
db.getCollection('matchWiseData').aggregate([
{ "$match":{
"matchData.matchId":{"$in":[ObjectId("57175c25561d87001e666d12")]}
}},
{ "$project": {
"venueId": 1,
"companyId": 1,
"cardTypeId": 1,
"matchData": {
"$filter": {
"input": "$matchData",
"as": "match",
"cond": {
"$or": [
{ "$eq": [ "$$match.matchId", ObjectId("57175c25561d87001e666d12") ] }
]
}
}
}
}}
])
And if you had at least MongoDB 2.6, you still could have used $map and $setDifference instead:
db.getCollection('matchWiseData').aggregate([
{ "$match":{
"matchData.matchId":{"$in":[ObjectId("57175c25561d87001e666d12")]}
}},
{ "$project": {
"venueId": 1,
"companyId": 1,
"cardTypeId": 1,
"matchData": {
"$setDifference": [
{ "$map": {
"input": "$matchData",
"as": "match",
"in": {
"$cond": [
{ "$or": [
{ "$eq": [ "$$match.matchId", ObjectId("57175c25561d87001e666d12") ] }
]},
"$$match",
false
]
}
}},
[false]
]
}
}}
])
That's perfectly fine when every array element already has a "unique" identifier, so the "set" operation just removes the false values from $map.
Both of those a ways to "filter" content from an array without actually using $unwind
N.B: Not sure if you really grasp that $in is used to match a "list of conditions" rather than being required to match on arrays. So generally the condition can just be:
"matchData.matchId": ObjectId("57175c25561d87001e666d12")
Where you only actually have a single value to match on. You use $in and $or when you have a "list" of conditions. Arrays themselves make no difference to the operator required.

Resources