How to find documents with unique records in MongoDB? - node.js

I have a collection with several documents in it of jobs to process using another system. I look up 5 objects from this table like this:
Work.find({status: 'new'})
.sort({priority: 'desc'})
.limit(5)
.exec(function (err, work){})
There is another field on these documents which determines that only one job with a given unique value can be ran at the same time.
For example these 3 jobs:
{uniqueVal: 1, priority: 1, type: "scrape", status:"new"}
{uniqueVal: 1, priority: 1, type: "scrape", status:"new"}
{uniqueVal: 2, priority: 1, type: "scrape", status:"new"}
There are 2 records with the uniqueVal of 1. Is there anything I can do to only pull one record with the value 1?
Note: These values are not predetermined values like in the example, they are ObjectIds of other documents.
I've looked into Distinct(), but it seems like it only returns the actual unique values, not the documents themselves.

I think the best choice is to use aggregation.
You can $group by uniqueVal
http://docs.mongodb.org/manual/reference/operator/aggregation/group/#pipe._S_group
And use $first for the other values
http://docs.mongodb.org/manual/reference/operator/aggregation/first/#grp._S_first

Related

MongoDB sort by custom calculation in Node.JS mongodb driver

I'm Using Node.JS MongoDB driver. I have a collection of job lists with salary and number of vacancies, I want to sort them according to one rule, if either salary or number of vacancies are greater they will get top priority in sorting, and I came up with this simple formula
( salary / 100 ) + num_of_vacancies
eg:
Top priority ones
{ salary: 5000 , num_of_vacancies: 500 } // value is 550
{ salary: 50000 , num_of_vacancies: 2 } // value is 502
And Less priority for
{ salary: 5000 , num_of_vacancies: 2 } // value is 52
But my Problem is, As far as I know, MongoDB sort takes arguments only to sort in ascending or descending order and a property to sort. How do I sort with custom expression.
The data in MongoDB looks like this // not the full varsion
{
title:"job title",
description:"job description",
salary:5000,
num_of_vacancy:50
}
This is just an option. Adjust it for a mongo driver.
$addFields we create the field to sort, named toSortLater just for semantic purposes.
add a $sort stage, and sort high values first. Change to 1 for the opposite behaviour.
db.collection.aggregate([{
$addFields:{
toSortLater:{
$add:[
{$divide:["$salary", 100]},
"$num_of_vacancies"]
}}}, {$sort:{"toSortLater":-1}}
])

MongoDB - too much data for sort() with no index. Full collection

I'm using Mongoose for Node.js to interface with the mongo driver, so my query looks like:
db.Deal
.find({})
.select({
_id: 1,
name: 1,
opp: 1,
dateUploaded: 1,
status: 1
})
.sort({ dateUploaded: -1 })
And get: too much data for sort() with no index. add an index or specify a smaller limit
The number of documents in the Deal collection is quite small, maybe ~500 - but each one contains many embedded documents. The fields returned in the query above are all primitive, i.e. aren't documents.
I currently don't have any indexes setup other than the default ones - I've never had any issue until now. Should I try adding a compound key on:
{ _id: 1, name: 1, opp: 1, status: 1, dateUploaded: -1 }
Or is there a smarter way to perform the query? First time using mongodb.
From the MongoDB documentation on limits and thresholds:
MongoDB will only return sorted results on fields without an index if the combined size of all documents in the sort operation, plus a small overhead, is less than 32 megabytes.
Probably all the embedded documents are too much, you should add an index on the sorted field dateUploaded if you want to run the same query.
Otherwise you can limit you query and start paginating the results.

How to speed up MongoDB count() Queries?

My collection is described as follows:
{ "_id" : ObjectId("5474af69d4b28042fb63b856"), "name" : "XXXX", "action" : "accept", "source" : "127.0.0.1", "srcport" : "80", "destination" : "192.168.0.13", "dstport" : "53213", "service" : "443", "service_id" : "https", "unixtime" : NumberLong("1412774569000"), "segment" : "MySegment", "direction" : "INCOMING", "location" : "US" }
I currently have ~5.5mio entries in my collection and the base query always is:
collection.count({"action":"2_different_action_types", "direction":"3_different_directions", "unixtime": {"$gte": 1412774000000, "$lte": 1412774900000}})
Action, direction and unixtime always exist in my query but their value is dynamic. Optional (also dynamic values) parameters are:
location
segment
service_id
For example:
collection.count({"action":"2_different_action_types", "direction":"3_different_directions", "location":"US","segment":"mySegment", "unixtime": {"$gte": 1412774000000, "$lte": 1412774900000}})
collection.count({"action":"2_different_action_types", "direction":"3_different_directions", "service_id":"https", "unixtime": {"$gte": 1412774000000, "$lte": 1412774500000}})
I created the following indexes:
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 })
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , location:1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , service_id:1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , segment:1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , location:1, service_id: 1})
db.collection.createIndex( {unixtime: 1, action: 1, direction: 1 , location:1, segment: 1})
My query without the index took ~8sec, the query with an index ~6sec, which is still kinda slow.
How can I speed up the whole thing? Note, that at the moment I'm just counting the findings, not really looking for a specific entry.
Additional Info:
I'm currently trying to optimize those queries directly in the mongoshell but in the end, I'm querying via NodeJS (don't know if this is relevant for the solution).
The indexes don't seem to make much sense this way. Not-equals-queries like $gte and $lte should be at the end - not only in the query, but also in the index. Putting unixtime at position 1 in the index is generally a bad idea (unless you need the set of distinct actions within a single second and the number of actions in a single second is so large that they need indexing, which is unlikely).
Try to reverse the indexes and make sure the order of the index matches the order in the query.
If location, segment and service_id have low selectivity, try without an index on these fields first. More indexes cost more RAM and slow insertion and update time, but with low selectivity, the gain in queries is sometimes negligible. In the query, it might make sense to put the optional fields last, at the end of all the other operations - if the candidate set is small enough after the required criteria and the unixtime interval, a collection scan of the remaining items shouldn't hurt performance too badly. If they do and the selectivity is high, move them further forward.

Increase performance for this MongoDB query

I have a MongoDB document with quite a large embedded array:
name : "my-dataset"
data : [
{country : "A", province: "B", year : 1990, value: 200}
... 150 000 more
]
Let us say I want to return data objects where country == "A".
What is the proper way of doing this, for example via NodeJs?
Given 150 000 entries with 200 matches, how long should the query take approximately?
Would it be better (performance/structure wise) to store data as documents and the name as a property of each document?
Would it be more efficient to use Mysql for this? )
A) Just find them with a query.
B) If the compound index {name:1, data.country:1} is built, the query should be fast. But you store all the data in one array, $unwind op has to be used. As a result, the query could be slow.
C) It will be better. If you store the data like:
{country : "A", province: "B", year : 1990, value: 200, name:"my-dataset"}
{country : "B", province: "B", year : 1990, value: 200, name:"my-dataset"}
...
With compound index {name:1, country:1}, the query time should be < 10ms.
D) MySQL vs MongoDB 1000 reads
1.You can use the MongoDB aggregation :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}}
])
Will return a document for each data entry where the country is "A". If you want to regroup the datasets, add a $group stage :
db.collection.aggregate([
{$match: {name: "my-dataset"}},
{$unwind: "$data"},
{$match: {"data.country": "A"}},
{$group: {_id: "$_id", data: {$addToSet: "$data"}}}
])
(Didn't test it on a proper dataset, so it might be bugged)
2.150000 Subdocuments is still not a lot for mongodb, so if you're only querying on one dataset it should be pretty fast (the order of the millisecond).
3.As long as you are sure that your document is going to be smaller than 16MB (kinda hard to say), the maximum BSON document size), it should be fine, but the queries would be simpler if you stored your data as documents with the dataset name as a property, which is generally better for performances.

Index multiple MongoDB fields, make only one unique

I've got a MongoDB database of metadata for about 300,000 photos. Each has a native unique ID that needs to be unique to protect against duplication insertions. It also has a time stamp.
I frequently need to run aggregate queries to see how many photos I have for each day, so I also have a date field in the format YYYY-MM-DD. This is obviously not unique.
Right now I only have an index on the id property, like so (using the Node driver):
collection.ensureIndex(
{ id:1 },
{ unique:true, dropDups: true },
function(err, indexName) { /* etc etc */ }
);
The group query for getting the photos by date takes quite a long time, as one can imagine:
collection.group(
{ date: 1 },
{},
{ count: 0 },
function ( curr, result ) {
result.count++;
},
function(err, grouped) { /* etc etc */ }
);
I've read through the indexing strategy, and I think I need to also index the date property. But I don't want to make it unique, of course (though I suppose it's fine to make it unique in combine with the unique id). Should I do a regular compound index, or can I chain the .ensureIndex() function and only specify uniqueness for the id field?
MongoDB does not have "mixed" type indexes which can be partially unique. On the other hand why don't you use _id instead of your id field if possible. It's already indexed and unique by definition so it will prevent you from inserting duplicates.
Mongo can only use a single index in a query clause - important to consider when creating indexes. For this particular query and requirements I would suggest to have a separate unique index on id field which you would get if you use _id. Additionally, you can create a non-unique index on date field only. If you run query like this:
db.collection.find({"date": "01/02/2013"}).count();
Mongo will be able to use index only to answer the query (covered index query) which is the best performance you can get.
Note that Mongo won't be able to use compound index on (id, date) if you are searching by date only. You query has to match index prefix first, i.e. if you search by id then (id, date) index can be used.
Another option is to pre aggregate in the schema itself. Whenever you insert a photo you can increment this counter. This way you don't need to run any aggregation jobs. You can also run some tests to determine if this approach is more performant than aggregation.

Resources