Querying mongodb for dups but allow certain duplicates based on timestamps - node.js

So I have a set of data that have timestamps associated with it. I want mongo to aggregate the ones that have duplicates within a 3 min timestamp. I'll show you an example of what I mean:
Original Data:
[{"fruit" : "apple", "timestamp": "2014-07-17T06:45:18Z"},
{"fruit" : "apple", "timestamp": "2014-07-17T06:47:18Z"},
{"fruit" : "apple", "timestamp": "2014-07-17T06:55:18Z"}]
After querying, it would be:
[{"fruit" : "apple", "timestamp": "2014-07-17T06:45:18Z"},
{"fruit" : "apple", "timestamp": "2014-07-17T06:55:18Z"}]
Because the second entry was within the 3 min bubble created by the first entry. I've gotten the code so that it aggregates and removed dupes that have the same fruit but now I only want to combine the ones that are within the timestamp bubble.

We should be able to do this! First lets split up an hour in 3 minute 'bubbles':
[0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57]
Now to group these documents we need to modify the timestamp a little. As far I as know this isn't currently possible with the aggregation framework so instead I will use the group() method.
In order to group fruits within the same time period we need to set the timestamp to the nearest minute 'bubble'. We can do this with timestamp.minutes -= (timestamp.minutes % 3).
Here is the resulting query:
db.collection.group({
keyf: function (doc) {
var timestamp = new ISODate(doc.timestamp);
// seconds must be equal across a 'bubble'
timestamp.setUTCSeconds(0);
// round down to the nearest 3 minute 'bubble'
var remainder = timestamp.getUTCMinutes() % 3;
var bubbleMinute = timestamp.getUTCMinutes() - remainder;
timestamp.setUTCMinutes(bubbleMinute);
return { fruit: doc.fruit, 'timestamp': timestamp };
},
reduce: function (curr, result) {
result.sum += 1;
},
initial: {
sum : 0
}
});
Example results:
[
{
"fruit" : "apple",
"timestamp" : ISODate("2014-07-17T06:45:00Z"),
"sum" : 2
},
{
"fruit" : "apple",
"timestamp" : ISODate("2014-07-17T06:54:00Z"),
"sum" : 1
},
{
"fruit" : "banana",
"timestamp" : ISODate("2014-07-17T09:03:00Z"),
"sum" : 1
},
{
"fruit" : "orange",
"timestamp" : ISODate("2014-07-17T14:24:00Z"),
"sum" : 2
}
]
To make this easier you could precompute the 'bubble' timestamp and insert it into the document as a separate field. The documents you create would look something like this:
[
{"fruit" : "apple", "timestamp": "2014-07-17T06:45:18Z", "bubble": "2014-07-17T06:45:00Z"},
{"fruit" : "apple", "timestamp": "2014-07-17T06:47:18Z", "bubble": "2014-07-17T06:45:00Z"},
{"fruit" : "apple", "timestamp": "2014-07-17T06:55:18Z", "bubble": "2014-07-17T06:54:00Z"}
]
Of course this takes up more storage. However, with this document structure you can use the aggregate function[0].
db.collection.aggregate(
[
{ $group: { _id: { fruit: "$fruit", bubble: "$bubble"} , sum: { $sum: 1 } } },
]
)
Hope that helps!
[0] MongoDB aggregation comparison: group(), $group and MapReduce

Related

Display the documents with partial matching of field value in mongodb

The below is my movie schema:
"_id" : ObjectId("59b9501600fcb397d6acd5bb"),
"theatreid" : 2,
"name" : "carnival cinemas",
"location" : "kanjulmarg",
"address" : "sec 2,kanjul, Mumbai, Maharashtra 400703",
"shows" : [
{
"mname" : "bareily ki barfi",
"timings" : [
10,
13,
14,
16,
22
]
},
{
"mname" : "Toilet:ek prem katha",
"timings" : [
8,
9,
14,
16,
20,
23
]
}
]
"_id" : ObjectId("59b9506500fcb397d6acd5bc"),
"theatreid" : 3,
"name" : "pheonix pvr",
"location" : "kurla",
"address" : "sec 26,kurla, Mumbai, Maharashtra 400701",
"shows" : [
{
"mname" : "shubh mangal savdhan",
"timings" : [
9,
11,
15,
18,
20
]
},
{
"mname" : "Toilet:ek prem katha",
"timings" : [
8,
9,
14,
16,
20,
23
]
}
]
My query is to display all the theaters having show timing (shows.timings) greater than 20 and the location is kurla.I am using nodeJS.
The query is as follows:
user.aggregate([{$match:{"location":"kurla"}},{"$addFields": {"shows": {"$map": {"input": "$shows","as": "resultm","in": {"name": "$$resultm.name","mname": "$$resultm.mname","timings": {"$filter": {"input": "$$resultm.timings","as": "resultf","cond": {"$gte": ["$$resultf",10]}}}}}}}}])
This works fine when it gets the exact match where location="kurla",but i want that it should also work and display the same records even if location=kurla east,maharashthra,mumbai.That is even if a partial match is found in the location attribute of my collection.How can this be done!Please help. Thanks:)
Use the $regex operator to apply a partial text match on the location attribute.
Replace this ...
{$match:{"location":"kurla"}}
... with this:
{ $match: {location: { $regex: /^kurla/ } } }
In the above example the condition is location like 'kurla%'.
More details in the docs.
This presumes that the values you wish to filter on are all in the location field, this is consistent with the following statement from your question:
location=kurla east,maharashthra,mumbai
There are no examples of "maharashthra" or "mumbai" in the location field in the example documents you showed however "maharashthra" does appear in the address attribute so if your filter is actually to be applied to location and/or address then the solution will be different to what I have posted above. If this is the case, perhaps you could update your question to clarify.

How to extract grouped results from an array inside a collection in Mongodb

I am working with the Foursquare API using NodeJS and Mongodb on the backend side. I have all the user information and checkin history stored in a collection. So the collection looks similar to this:
{
_id: ...,
foursquareId: ...
personalInfo: {},
checkins: [
{
id: ...,
createdAt: 123456789 //Seconds since epoch>,
venue: {},
...
},
{
id: ...,
createdAt: 123456789 //Seconds since epoch>,
venue: {},
...
},
...
]
}
For this question I am only interested to the checkins array. I need to return a list of checkins quantity by month and year, but I am not sure which is the best way to approach this. I think that the result would be something like this: (I am not totally convinced though)
{
'2016': {
'January': 43,
'February': 38,
'March': 40,
'April': 48,
'May': 50,
'June': 41,
'July': 39,
'August': 38,
'September': 30,
'October': 29,
'November': 38,
'December': 41
},
'2017': {
'January': 55,
'February': 20
}
}
I am not interested about the way I receive the information on the frontend. I want to know if is possible to do this in mongodb because I couldn't find a way to do it on their documentation or any other example here. Otherwise I might need to do it in the frontend (not a good idea...so I could have around 7k results or more on this array...).
Using the aggregation framework should get you what you want.
db.collectionName.aggregate([
{$unwind:'$checkins'},
{
$project: {
id: 1,
'checkins.createdAt' : 1,
newDate : {
$add : [ new Date(0), {
$multiply : [ "$checkins.createdAt", 1000 ]
}]
}
}
},
{$project : {
year: {$year: "$newDate"},
month: {$month: "$newDate"}
}},
{$group: {_id:{year:"$year", month:"$month"}, count:{$sum:1}}},
{$group: {_id:{year:"$_id.year"}, monthTotals: { $push: { month: "$_id.month", count: "$count" } }}}
])
This produces documents like the following:
{
"_id" : {
"year" : NumberInt(2016)
},
"monthTotals" : [
{"month" : NumberInt(1),"count" : NumberInt(2)}
{"month" : NumberInt(2),"count" : NumberInt(3)}
]
}
The second step (first $project step) may need to be adjusted depending on how your date since epoch value is stored, but this should get you generally what you need.
There's not a way to get the data exactly as you've outlined without some post processing of the results, but it should be simple enough to modify the result.

MongoDB use position in sorted query result to compute field

I have a Mongoose Model for users. Each user has a certain amount of points. I'd like to create a field that is the users rank where:
rank = user position sorted by rank / total users
Let's suppose the user model looks like this:
{
'name': 'bob',
'points': 15,
'rank': 9/15,
}
(I realize that the fraction would really be a decimal when stored).
Is there a way that I can update all of these users by:
1) Sorting them by points
2) Get a user's index in this sorted list
3) Divide that index by the total number of items in the list
I'm not sure what kind of mongo operators are out there for finding a doc's position in query results and for finding the total size of the query results.
Using the previous answer is not a good idea. It requires recalculating rank after each update of points values.
Mongo version 5.0+ introduced $rank aggregation:
db.users.aggregate([
{
$setWindowFields: {
sortBy: { points: 1 },
output: {
rank: {
$rank: {}
}
}
}
}
])
will output
{ "points": 140, "rank": 1 },
{ "points": 160, "rank": 2 },
{ "points": 170, "rank": 3 },
{ "points": 180, "rank": 4 },
{ "points": 220, "rank": 5 }
You can do this using a couple of queries and a bit of JavaScript. Expanding on the steps you outlined, what you need to do is:
Find all of the user documents, sort them by points in descending order and assign the results to a cursor. You might want to ensure that you have an index on this field to make this query run faster.
Get the count for the number of documents returned.
Keep track of the position of the document within the results using an index.
Iterate through the documents, calculating the rank using the count and the index, and updating the corresponding user's rank with the result of that calculation.
In the mongo shell, the code would look something like the following.
var c = db.user.find().sort({ "points": -1 });
var count = c.count();
var i = 1;
while (c.hasNext()) {
var rank = i / count;
var user = c.next();
db.user.update(
{ "_id": user._id },
{ "$set": { "rank": rank } }
);
i++;
}
So if you had the following three users in your collection:
{
"_id" : ObjectId("54f0af63cfb269d664de0b4e"),
"name" : "bob",
"points" : 15,
"rank" : 0
}
{
"_id" : ObjectId("54f0af7fcfb269d664de0b4f"),
"name" : "arnold",
"points" : 20,
"rank" : 0
}
{
"_id" : ObjectId("54f0af95cfb269d664de0b50"),
"name" : "claus",
"points" : 10,
"rank" : 0
}
After the update their documents would look like this:
{
"_id" : ObjectId("54f0af63cfb269d664de0b4e"),
"name" : "bob",
"points" : 15,
"rank" : 0.6666666666666666
}
{
"_id" : ObjectId("54f0af7fcfb269d664de0b4f"),
"name" : "arnold",
"points" : 20,
"rank" : 0.3333333333333333
}
{
"_id" : ObjectId("54f0af95cfb269d664de0b50"),
"name" : "claus",
"points" : 10,
"rank" : 1
}

MongoDB - get documents with max attribute per group in a collection

My data looks like this:
session, age, firstName, lastName
1, 28, John, Doe
1, 21, Donna, Keren
2, 32, Jenna, Haze
2, 52, Tommy, Lee
..
..
I'd like to get all the rows which are the largest (by age) per session. So So for the above input my output would look like:
sessionid, age, firstName, lastName
1, 28, John, Doe
2, 52, Tommy, Lee
because John has the largest age in the session = 1 group and Tommy has the largest age on the session=2 group.
I need to export the result to a file (csv) and it may contain lots of records.
How can I achieve this?
MongoDB aggregation offers the $max operator, but in your case you want the "whole" record as it is. So the appropriate thing to do here is $sort and then use the $first operator within a $group statement:
db.collection.aggregate([
{ "$sort": { "session": 1, "age": -1 } },
{ "$group": {
"_id": "$session",
"age": { "$first": "$age" },
"firstName": { "$first" "$firstName" },
"lastName": { "$first": "$lastName" }
}}
])
So the "sorting" gets the order right, and the "grouping" picks the first occurrence within the "grouping" key where those fields exist.
Mostly $first here because the $sort is done in reverse order. You can also use $last when in an ascending order as well.
You could try the below aggregation query which uses max attribute: http://docs.mongodb.org/manual/reference/operator/aggregation/max/
db.collection.aggregate([
$group: {
"_id": "$session",
"age": { $max: "$age" }
},
{ $out : "max_age" }
])
The results should be outputted to the new collection max_age and then you could dump it into a csv.
Note: it will give only the session and max age and will not return other fields.

Calculating count and average with MongoDB aggregation

I have a simple db layout like this:
client
id
sex (male/female)
birthday (date)
client
id
sex (male/female)
birthday (date)
(...)
I'm trying to write an aggregation command that outputs how many male and female clients I've got, and I'd also like to output the average age of males and females, not sure I can do this in the same command or I need 2 separate ones?
// Count of males/females, average age
Clients.aggregate({
$project : {"sex" : 1,
"sexCount" : 1,
"birthday" : 1,
"avgAge" : 1
}
},
{
$match: {"sex": {$exists: true}}
},
{
$group: {
_id : "$sex",
sexCount : { $sum: 1 },
avgAge : { $avg: "$birthday" },
}
},
{ $sort: { _id: 1 } }
, function(err, sex_dbres) {
if (err)
throw err;
else{
(...)
}
});
With the code above I get the counts of male/female, but avgAge comes as 0. Any ideas?
Many thanks
The answer would be much simpler if you were storing age in the original document (as Dmitry posted, you could just do a straight avgAge:{$avg:"$age"} in your $group step.
Aggregation Framework is pretty nifty though and has many cool operators which allow you to compute this missing age field "on the fly".
I'm going to store each step of the aggregation in a variable so it's easier to see what's going on:
today = new Date();
// split today and bday into numerical year and numerical day-of-the-year
project1= {
"$project" : {
"sex" : 1,
"todayYear" : {
"$year" : today
},
"todayDay" : {
"$dayOfYear" : today
},
"by" : {
"$year" : "$bday"
},
"bd" : {
"$dayOfYear" : "$bday"
}
}
};
// calculate age in days by subtracting bday in days from today in days
project2 = {
"$project" : {
"sex" : 1,
"age" : {
"$subtract" : [
{
"$add" : [
{
"$multiply" : [
"$todayYear",
365
]
},
"$todayDay"
]
},
{
"$add" : [
{
"$multiply" : [
"$by",
365
]
},
"$bd"
]
}
]
}
}
};
// sum up for each sex the count and compute avg age (in days)
group = {
"$group" : {
"_id" : "$sex",
"total" : {
"$sum" : 1
},
"avgAge" : {
"$avg" : "$age"
}
}
};
// divide days by 365 to get age in years.
project3 = {
"$project" : {
"_id" : 0,
"sex" : "$_id",
"total" : 1,
"averageAge" : {
"$divide" : [
"$avgAge",
365
]
}
}
};
Now you can run the aggregation:
> db.client.find({},{_id:0})
{ "sex" : "male", "bday" : ISODate("2000-02-02T08:00:00Z") }
{ "sex" : "male", "bday" : ISODate("1987-02-02T08:00:00Z") }
{ "sex" : "female", "bday" : ISODate("1989-02-02T08:00:00Z") }
{ "sex" : "female", "bday" : ISODate("1993-11-02T08:00:00Z") }
> db.client.aggregate([ project1, project2, group, project3 ])
{
"result" : [
{
"sex" : "female",
"total" : 2,
"averageAge" : 21.34109589041096
},
{
"sex" : "male",
"total" : 2,
"averageAge" : 19.215068493150685
}
],
"ok" : 1
}
>
The reason this is not simple is currently Aggregation Framework does not support direct subtraction of dates. Please vote for https://jira.mongodb.org/browse/SERVER-6239 which is targeted for the next major release - once it's implemented it should allow subtraction of dates directly (though you will still need to convert it to appropriate granularity, years in this case probably).
The date object can't be "averaged", but numbers can. You can convert your dates to the timestamp value, and then find average from it. But still that won't be an average age, you'll need to subtract result from the current date outside of the aggregation function.
Another option is to assume that age can be calculated using only year part of the date (that is, if I was born on December 1, 2000, in today's report I'll be 12 years old, not 11). In this case you can use date operators to extract year value.
$project : {"sex" : 1,
"sexCount" : 1,
"year" : {$year: "$birthday"},
}
},
$project : {"sex" : 1,
"sexCount" : 1,
"age" : {$subtract: [2012, '$year']},
}
},

Resources