Compound index query - node.js

I've got a query that i'm running that is pretty slow and am trying to do compound index on it, correctly, total newbie here. I've done the below index am just wondering if i've done it correctly, doesn't seem to have made a difference at all =/
Match.statics.getMatchesForDay = function (day, liveOnly, excludedAreas, excludedCompetitions, doneCallback) {
var include = "-_id match_id title date_utc date_iso _grouping match_info.period match_info.minute match_info.minute_extra match_info.full_time team_a team_b status time_utc";
var filter = {
date_utc: day,
score_coverage: true,
"_grouping._area.area_id": {
// Only include active areas
"$nin": excludedAreas
},
"_grouping._competition.competition_id": {
// Only include active competitions
"$nin": excludedCompetitions
}
};
if (liveOnly)
filter.status = "Playing";
this.find(filter).sort({
"_grouping._area.name": 1, //Sort by asc
"_grouping._competition.competition_id": 1, //Sort by asc
"date_iso": 1, //Sort by asc
}).select(include).exec(doneCallback);
};
Match.index({
date_utc: -1,
score_coverage: -1,
"_grouping._area.area_id": 1,
"_grouping._competition.competition_id": 1
});
Match.index({"_grouping._area.area_id": 1, "_grouping._competition.competition_id": 1, date_iso: 1});
My .explain() output:
{
"cursor" : "BtreeCursor date_utc_1_score_coverage_-1__grouping._area.area_id_1__grouping._competition.competition_id_1",
"isMultiKey" : false,
"n" : 358,
"nscannedObjects" : 358,
"nscanned" : 358,
"nscannedObjectsAllPlans" : 863,
"nscannedAllPlans" : 863,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 6,
"nChunkSkips" : 0,
"millis" : 7,
"indexBounds" : {
"date_utc" : [
[
"2015-03-15",
"2015-03-15"
]
],
"score_coverage" : [
[
true,
true
]
],
"_grouping._area.area_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"_grouping._competition.competition_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "xxxxx:27017",
"filterSet" : false
}
Unless it is working and i'm being picky, on a tiny collection it's around 200ms, on the larger ones 1-3s+. At the moment, the $nin is empty on both.
Thanks for looking.

That query won't be fast no matter what index you use. Let's look at the filter step-by-step:
var filter = {
date_utc: day,
Fine so far. An equality query can be indexed. Dates should have high enough selectivity. It's also likely to provide good data-locality.
score_coverage: true,
A boolean? Bad thing - indexes are essentially tree structures. If your datum is a single boolean, however, there's only two possible options: true and false. (low selectivity). This means that the tree has one true and one false branch that contain all the data 'below' it. That makes the tree become a linked list, essentially. It also destroys locality, because changing the value will have to rearrange entire sub trees. Move this to the end, and remove it from the index.
"_grouping._area.area_id": {
// Only include active areas
"$nin": excludedAreas
},
Indexes work like the letter marks in phone books. You're looking for "john doe"? Fine, look up the letter "D" visible from the outside (the index), then search for the "o" in "Do", and so forth. Suppose I gave you a phone book and asked you to find all the poeple that are NOT named "Doe". Does the index help? Not really, it would have been easy to skip "Doe". After all, you'll need to go through the whole thing anyway. Again, this is a problem of low selectivity.
"_grouping._competition.competition_id": {
// Only include active competitions
"$nin": excludedCompetitions
}
Same argument, $nin on large amounts of data is bad.
Now, the sorting:
this.find(filter).sort({
"_grouping._area.name": 1, //Sort by asc
"_grouping._competition.competition_id": 1, //Sort by asc
"date_iso": 1, //Sort by asc
}).select(include).exec(doneCallback);
Sorting is relatively expensive operation, so you'll want to ensure your indexes match the following rule because then the data is already sorted in the index:
equality criteria -- range criteria -- sort criteria
But now you have turned around the order of date-area-competitionId used in the equality and range criteria to area-competitionId-date for sorting.
Solving this requires understanding of the problem domain. I suggest you try to rearrange the data structure based on query selectivity / locality concerns. Queries should be simple.

Related

Sort documents by a present field and a calculated value

How would I go about displaying the best reviews and the worst reviews at the top of the page.
I think the user's "useful" and "notUseful" votes should have an effect on the result.
I have reviews and if people click on the useful and notUseful buttons their Id gets added to the appropriate array (useful or notUseful).
you can tell what a positive or a negative score is by the "overall" score. that is 1 through 5. so 1 would be the worst and 5 would be the best.
I guess If someone gave a review with a 5 overall score but only got one useful but someone gave a score with a 4 overall and 100 people clicking on "useful" the one with 100 people should be shown as the best positive?
I only want to show 2 reviews at the top of the page the best and the worst worst review if there are ties with the overall scores the deciding factor should be the usefulness. so if there are 2 reviews with the same overall score and one of them has 5 usefuls and 10 notUsefuls that would be -5 usefuls and in the other review someone has 5 usefuls and and 4 notUsefuls that would be 1 usefuls so that would be shown to break the tie.
I'm hopping to do it with one mongoose query and not aggregation but I think the answer will be aggregation.
I guess there could be a cut off like scores greater than 3 is a positive review and lower is negative review.
I use mongoose.
Thanks in advance for your help.
some sample data.
{
"_id" : ObjectId("5929f89a54aa92274c4e4677"),
"compId" : ObjectId("58d94c441eb9e52454932db6"),
"anonId" : ObjectId("5929f88154aa92274c4e4675"),
"overall" : 3,
"titleReview" : "53",
"reviewText" : "53",
"companyName" : "store1",
"replies" : [],
"version" : 2,
"notUseful" : [ObjectId("58d94c441eb9e52454932db6")],
"useful" : [],
"dateCreated" : ISODate("2017-05-27T22:07:22.207Z"),
"images" : [],
"__v" : 0
}
{
"_id" : ObjectId("5929f8dfa1435135fc5e904b"),
"compId" : ObjectId("58d94c441eb9e52454932db6"),
"anonId" : ObjectId("5929f8bab0bc8834f41e9cf8"),
"overall" : 3,
"titleReview" : "54",
"reviewText" : "54",
"companyName" : "store1",
"replies" : [],
"version" : 1,
"notUseful" : [ObjectId("5929f83bf371672714bb8d44"), ObjectId("5929f853f371672714bb8d46")],
"useful" : [],
"dateCreated" : ISODate("2017-05-27T22:08:31.516Z"),
"images" : [],
"__v" : 0
}
{
"_id" : ObjectId("5929f956a692e82398aaa2f2"),
"compId" : ObjectId("58d94c441eb9e52454932db6"),
"anonId" : ObjectId("5929f93da692e82398aaa2f0"),
"overall" : 3,
"titleReview" : "56",
"reviewText" : "56",
"companyName" : "store1",
"replies" : [],
"version" : 1,
"notUseful" : [],
"useful" : [],
"dateCreated" : ISODate("2017-05-27T22:10:30.608Z"),
"images" : [],
"__v" : 0
}
If I am reading your question correctly then it appears you want a calculated difference of the "useful" and "nonUseful" votes to also be taken into account when sorting on the "overall" score of the documents.
The better option here is include that calculation in your stored documents, but for totality we will cover both options.
Aggregation
Without changes to your schema and other logic, then aggregation is indeed required to do that calculation. This is best presented as:
Model.aggregate([
{ "$addFields": {
"netUseful": {
"$subtract": [
{ "$size": "$useful" },
{ "$size": "$notUseful" }
]
}
}},
{ "$sort": { "overall": 1, "netUseful": -1 } }
],function(err, result) {
})
So you are basically getting the difference between the two arrays, where more "useful" responses have a positive impact boosting the ranking ans more "notUseful" will reduce that impact. Depending on the MongoDB version you have available you use either $addFields with only the additional field or $project with all the fields you need to return.
The $sort is then performed on the combination of the "overall" score in ascending order as per your rules, and the new field of "netUseful" in descending order ranking "positive" to "negative".
Re-Modelling
Foregoing aggregation altogether, you get a faster result from the plain query. But this of course means maintaining that "score" in the document as you add members to the array.
In basic options, you are using the $inc update operator along with $push to change the score.
So for a "useful" entry, you would do something like this:
Model.update(
{ "_id": docId, "useful": { "$ne": userId } },
{
"$push": { "useful": userId },
"$inc": { "netUseful": 1 }
},
function(err, status) {
}
)
And for a "notUseful" you do the opposite by "decrementing" with a negative value to $inc:
Model.update(
{ "_id": docId, "nonUseful": { "$ne": userId } },
{
"$push": { "nonUseful": userId },
"$inc": { "netUseful": -1 }
},
function(err, status) {
}
)
To cover all cases including where a vote is "changed" from "useFul" to "nonUseful" then you would expand on the logic and implement the appropriate reverse actions with $pull. But this should give the general idea.
N.B The reason we do not use the $addToSet operation here is because we want to make sure the user id is not present in the array when "incrementing" or "decrementing". Thus instead the $ne operator is used to test the value does not exist. If it does, then we do not attempt to modify the array or affect the "netUseful" value. The same applies to the reverse case of "removing" the user from those votes.
Since the calculation is always maintained with each update, you simply perform as query with a standard .sort()
Model.find().sort({ "overall": 1, "netUseful": -1 }).exec(function(err,results) {
})
So by moving the "cost" into the maintenance of the "votes", you remove the overhead of running the aggregation later. For my money, where this is a regular operation and the "sort" does not rely on other run-time parameters which force the calculation to be dynamic, then you use the stored result instead.

Conditional Projection if element exists in Array in mongodb

Is there a direct way to project a new field if a value matches one in a huge sub array. I know i can use the $elemMatch or $ in the $match condition, but that would not allow me to get the rest of the non matching values (users in my case).
Basically i want to list all type 1 items and show all the users while highlighting the subscribed user. The reason i want to do this through mongodb is to avoid iterating over multiple thousand users for every item. Infact that is the part 2 of my question, can i limit the number of user's array that would be returned, i just need around 10 array values to be returned not thousands.
The collection structure is
{
name: "Coke",
type: 2,
users:[{user: 13, type:1},{ user:2: type:2}]
},
{
name: "Adidas",
type: 1,
users:[{user:31, type:3},{user: 51, type:1}]
},
{
name: "Nike",
type: 1,
users:[{user:21, type:3},{user: 31, type:1}]
}
Total documents are 200,000+ and growing...
Every document has 10,000~50,000 users..
expected return
{
isUser: true,
name: "Adidas",
type: 1,
users:[{user:31, type:3},{user: 51, type:1}]
},
{
isUser: false,
name: "Nike",
type: 1,
users:[{user:21, type:3},{user: 31, type:1}]
}
and i've been trying this
.aggregate([
{$match:{type:1}},
{$project:
{
isUser:{$elemMatch:["users.user",51]},
users: 1,
type:1,
name: 1
}
}
])
this fails, i get an error "Maximum Stack size exceeded". Ive tried alot of combinations and none seem to work. I really want to avoid running multiple calls to mongodb. Can this be done in a single call?
I've been told to use unwind, but i am bit worried that it might lead to memory issues.
If i was using mysql, a simple subquery would have done the job... i hope i am overlooking a similar simple solution in mongodb.
Process the conditions for the array elements and match the result by using a combination of the $anyElementTrue which evaluates an array as a set and returns true if any of the elements are true and false otherwise, the $ifNull operator will act as a safety net that evaluates the following $map expression and returns the value of the expression if the expression evaluates to a non-null value. The $map in the $ifNull operator is meant to apply the conditional statement expression to each item in the users array and returns an array with the applied results. The resulting array will then be used evaluated by the $anyElementTrue and this will ultimately calculate and return the isUser field for each document:
db.collection.aggregate([
{ "$match": { "type": 1} },
{
"$project": {
"name": 1, "type": 1,
"isUser": {
"$anyElementTrue": [
{
'$ifNull': [
{
"$map": {
"input": "$users",
"as": "el",
"in": { "$eq": [ "$$el.user",51] }
}
},
[false]
]
}
]
}
}
}
])

Sorting and placing matched values on top

I am using MongoDB and Node.js to display a record set in a page. I have got as far as displaying them on the page alphabetically, but I would like to display one row (the "default" row) at the top, and all the others alphabetically beneath it.
I know, I know, Mongo is definitely not SQL, but in SQL I would have done something like this:
SELECT *
FROM themes
ORDER BY name != "Default", name ASC;
or perhaps even
SELECT * FROM themes WHERE name = "Default"
UNION
SELECT * FROM themes WHERE name != "Default" ORDER BY name ASC;
I have tried a few variations of Mongo's sorting options, such as
"$orderby": {'name': {'$eq': 'Default'}, 'name': 1}}
but without any luck so far. I have been searching a lot for approaches to this problem but I haven't found anything. I am new to Mongo but perhaps I'm going about this all wrong.
My basic code at the moment:
var db = req.db;
var collection = db.get('themes');
collection.find({"$query": {}, "$orderby": {'name': 1}}, function(e, results) {
res.render('themes-saved', {
title: 'Themes',
section: 'themes',
page: 'saved',
themes: results
});
});
You cannot do that in MongoDB, as sorting must be on a specific value already present in a field of your document. What you "can" do is $project a "weighting" to the record(s) matching your condition. As in:
collection.aggregate(
[
{ "$project": {
"each": 1,
"field": 1,
"youWant": 1,
"name": 1,
"weight": {
"$cond": [
{ "$eq": [ "$name", "Default" ] },
10,
0
]
}
}},
{ "$sort": { "weight": -1, "name": 1 } }
],
function(err,results) {
}
);
So you logically inspect the field you want to match a value in ( or other logic ) and then assign a value to that field, and a lower score or 0 to those that do not match.
When you then $sort on that "weighting" first in order ( decending from highest in this case ) so that those values are listed before others with a lower weighting.

Mongo query getting totals

If I had a schema that looked something like this:
var person = new Schema({
active: {type: Boolean},
otherSetting: {type: Boolean}
});
Would it be possible with just one query to get the entire total count of all people, total people active, total people inactive, as well as the total count for people with otherSetting set to true and other Setting set to false? Would otherSetting and active have to be broken into two queries?
I've been playing around with the aggregate framework on this problem and although this seems like a simple problem, I can't seem to do it with just one query.
Is it even possible? Thanks for any help.
The aggregation framework has logical operators such as $cond that work well with your boolean conition here:
db.collection.aggregate([
{ "$group": {
"_id": null,
"active": { "$sum": { "$cond": [ "$active", 1, 0 ] } },
"inActive": { "$sum": { "$cond": [ "$active", 0, 1 ] } },
"total": { "$sum": 1 }
}}
])
The $cond operator is a "ternary" operator ( if/then/else ) that allows the evaluation of a logical condition to return the true ( then ) or false ( else ) values.
The "boolean" is evaluated as true/false in the first argument to $cond which passes the appropriate value to $sum in order to get the conditional totals.
Everything works within a single $group pipeline stage with a grouping key _id of null since you want to add up the whole collection. If grouping on the value of another field then replace that null with the field you want.

ElasticSearch -- boosting relevance based on field value

Need to find a way in ElasticSearch to boost the relevance of a document based on a particular value of a field. Specifically, there is a special field in all my documents where the higher the field value is, the more relevant the doc that contains it should be, regardless of the search.
Consider the following document structure:
{
"_all" : {"enabled" : "true"},
"properties" : {
"_id": {"type" : "string", "store" : "yes", "index" : "not_analyzed"},
"first_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"last_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"boosting_field": {"type" : "integer", "store" : "yes", "index" : "yes"}
}
}
I'd like documents with a higher boosting_field value to be inherently more relevant than those with a lower boosting_field value. This is just a starting point -- the matching between the query and the other fields will also be taken into account in determining the final relevance score of each doc in the search. But, all else being equal, the higher the boosting field, the more relevant the document.
Anyone have an idea on how to do this?
Thanks a lot!
You can either boost at index time or query time. I usually prefer query time boosting even though it makes queries a little bit slower, otherwise I'd need to reindex every time I want to change my boosting factors, which usally need fine-tuning and need to be pretty flexible.
There are different ways to apply query time boosting using the elasticsearch query DSL:
Boosting Query
Custom Filters Score Query
Custom Boost Factor Query
Custom Score Query
The first three queries are useful if you want to give a specific boost to the documents which match specific queries or filters. For example, if you want to boost only the documents published during the last month. You could use this approach with your boosting_field but you'd need to manually define some boosting_field intervals and give them a different boost, which isn't that great.
The best solution would be to use a Custom Score Query, which allows you to make a query and customize its score using a script. It's quite powerful, with the script you can directly modify the score itself. First of all I'd scale the boosting_field values to a value from 0 to 1 for example, so that your final score doesn't become a big number. In order to do that you need to predict what are more or less the minimum and the maximum values that the field can contain. Let's say minimum 0 and maximum 100000 for instance. If you scale the boosting_field value to a number between 0 and 1, then you can add the result to the actual score like this:
{
"query" : {
"custom_score" : {
"query" : {
"match_all" : {}
},
"script" : "_score + (1 * doc.boosting_field.doubleValue / 100000)"
}
}
}
You can also consider to use the boosting_field as a boost factor (_score * rather than _score +), but then you'd need to scale it to an interval with minimum value 1 (just add a +1).
You can even tune the result in order the change its importance adding a weight to the value that you use to influence the score. You are going to need this even more if you need to combine multiple boosting factors together in order to give them a different weight.
With a recent version of Elasticsearch (version 1.3+) you'll want to use "function score queries":
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
A scored query_string search looks like this:
{
'query': {
'function_score': {
'query': { 'query_string': { 'query': 'my search terms' } },
'functions': [{ 'field_value_factor': { 'field': 'my_boost' } }]
}
}
}
"my_boost" is a numeric field in your search index that contains the boost factor for individual documents. May look like this:
{ "my_boost": { "type": "float", "index": "not_analyzed" } }
if you want to avoid to do the boosting each time inside the query, you might consider to add it to your mapping directly adding "boost: factor.
So your mapping then may look like this:
{
"_all" : {"enabled" : "true"},
"properties" : {
"_id": {"type" : "string", "store" : "yes", "index" : "not_analyzed"},
"first_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"last_name": {"type" : "string", "store" : "yes", "index" : "yes"},
"boosting_field": {"type" : "integer", "store" : "yes", "index" : "yes", "boost" : 10.0,}
}
}
If you are using Nest, you should use this syntax:
.Query(q => q
.Bool(b => b
.Should(s => s
.FunctionScore(fs => fs
.Functions(fn => fn
.FieldValueFactor(fvf => fvf
.Field(f => f.Significance)
.Weight(2)
.Missing(1)
))))
.Must(m => m
.Match(ma => ma
.Field(f => f.MySearchData)
.Query(query)
))))

Resources