MongoDB/Mongoose index make query faster or slow it down? - node.js

I have an article model like this:
var ArticleSchema = new Schema({
type: String
,title: String
,content: String
,hashtags: [String]
,comments: [{
type: Schema.ObjectId
,ref: 'Comment'
}]
,replies: [{
type: Schema.ObjectId
,ref: 'Reply'
}]
, status: String
,statusMeta: {
createdBy: {
type: Schema.ObjectId
,ref: 'User'
}
,createdDate: Date
, updatedBy: {
type: Schema.ObjectId
,ref: 'User'
}
,updatedDate: Date
,deletedBy: {
type: Schema.ObjectId,
ref: 'User'
}
,deletedDate: Date
,undeletedBy: {
type: Schema.ObjectId,
ref: 'User'
}
,undeletedDate: Date
,bannedBy: {
type: Schema.ObjectId,
ref: 'User'
}
,bannedDate: Date
,unbannedBy: {
type: Schema.ObjectId,
ref: 'User'
}
,unbannedDate: Date
}
}, {minimize: false})
When user creates or modify the article, I will create hashtags
ArticleSchema.pre('save', true, function(next, done) {
var self = this
if (self.isModified('content')) {
self.hashtags = helper.listHashtagsInText(self.content)
}
done()
return next()
})
For example, if user write "Hi, #greeting, i love #friday", I will store ['greeting', 'friday'] in hashtags list.
I am think about creating an index for hashtags to make queries on hashtags faster. But from mongoose manual, I found this:
When your application starts up, Mongoose automatically calls
ensureIndex for each defined index in your schema. Mongoose will call
ensureIndex for each index sequentially, and emit an 'index' event on
the model when all the ensureIndex calls succeeded or when there was
an error. While nice for development, it is recommended this behavior
be disabled in production since index creation can cause a significant
performance impact. Disable the behavior by setting the autoIndex
option of your schema to false.
http://mongoosejs.com/docs/guide.html
So is indexing faster or slower for mongoDB/Mongoose?
Also, even if I create index like
hashtags: { type: [String], index: true }
How can I make use of the index in my query? Or will it just magically become faster for normal queries like:
Article.find({hashtags: 'friday'})

You are reading it wrong
You are misreading the intent of the quoted block there as to what .ensureIndex() ( now deprecated, but still called by mongoose code ) actually does here in the context.
In mongoose, you define an index either at the schema or model level as is appropriate to your design. What mongoose "automatically" does for you is on connection it inpects each registered model and then calls the appropriate .ensureIndex() methods for the index definitions provided.
What does this actually do?
Well, in most cases, being after you have already started up your application before and the .ensureIndexes() method was run is Absolutely Nothing. That is a bit of an overstatement, but it more or less rings true.
Because the index definition has already been created on the server collection, a subsesquent call does not do anything. I.e, it does not drop the index and "re-create". So the real cost is basically nothing, once the index itself has been created.
Creating indexes
So since mongoose is just a layer on top of the standard API, the createIndex() method contains all the details of what is happening.
There are some details to consider here, such as that an index build can happen in the "background", and while this is less intrusive to your application it does come at it's own cost. Notably that the index size from "background" generation will be larger than if you built it n the foreground, blocking other operations.
Also all indexes come at a cost, notably in terms of disk usage as well as an additional cost of writing the additional information outside of the collection data itself.
The adavantages of an index are that it is much faster to "search" for values contained within an index than to seek through the whole collection and match the possible conditions.
These are the basic "trade-offs" associated with indexes.
Deployment Pattern
Back to the quoted block from the documentation, there is a real intent behind this advice.
It is typical in deployment patterns and particularly with data migrations to do things in this order:
Populate data to relevant collections/tables
Enable indexes on the collection/table data relevant to your needs
This is because there is a cost involved with index creation, and as mentioned earlier it is desirable to get the most optimum size from the index build, as well as avoid having each document insertion also having the overhead of writing an index entry when you are doing this "load" in bulk.
So that is what indexes are for, those are the costs and benefits and the message in the mongoose documentation is explained.
In general though, I suggest reading up on Database Indexes for what they are and what they do. Think of walking into a library to find a book. There is a card index there at the entrance. Do you walk around the library to find the book you want? Or do you look it up in the card index to find where it is? That index took someone time to create and also keep it updated, but it saves "you" the time of walking around the whole library just so you can find your book.

Related

Is it bad practice to set mongodb object id to a 'String' instead of 'Schema.Types.ObjectId'?

I want to know if this can affect performance or other important matters in terms of functionality especially when finding documents in a mongodb collection
I have done this
var ComputerArticleSchema = mongoose.Schema({
_id: {
type: String,
required: true
},
its commonly done like this
_id: {
type: Schema.Types.ObjectId,
required: true
},
Not performance but when you have more than two model and they have their association then how can you link between them. So, it is necessary. And you don't have to manually define(write) like you did, mongodb automatically create the _id.
_id is created automatically!
If you are referencing other table use 'table_id' (or some other key name) and give type as Schema.Types.ObjectId or Schema.ObjectId.
const ObjectId = Schema.ObjectId;
user_id: { type: ObjectId, ref: 'User' }
Storing strings rather than ObjectIds does hurt performance.
ObjectIds are smaller than the equivalent strings (they are a 12 byte binary value rather than a 24 character UTF-8 string value), so they take up less space in memory.
Mongo is really fast when indexes & documents are in the working set (i.e. in memory) so by lowering the data footprint, you're able to make sure that more documents stay in memory. This is especially important because the id fields that you're talking about are often included in indexes.

Mongoose: Difference between referencing "Schema.ObjectId" instead of directly using the schema name?

Suppose I have the following MessageSchema model:
var MessageSchema = new Schema({
userId: String,
username: String,
message: String,
date: Date
});
mongoose.model('Message', MessageSchema)
Can someone tell me the difference between the following two implementations of the Meetings model? Thanks.
var Meetings = new Schema({
_id: ObjectId,
name: String,
messages: [MessageSchema],
...
});
var Meetings2 = new Schema({
_id: ObjectId,
name: String,
messages: [{type: Schema.ObjectId, ref: 'Message'}],
...
});
The main difference is that Meeting model is embedding the MessageSchema (denormalization) whilst the Meeting2 model references it (normalization). The difference in choice boils down to your model design and that depends mostly on how you query and update your data. In general, you would want to use an embedded schema design approach if the subdocuments are small and the data does not change frequently. Also if the Message data grows by a small amount, consider denormalizing your schema. The embedded approach allows you to do optimized reads thus can be faster since you will only execute a single query as all the data resides in the same document.
On the other hand, consider referencing if your Message documents are very large so they are kept in a separate collection that you can then reference. Another factor that determines the referencing approach is if your document grows by a large amount. Another important consideration is how often the data changes (volatility) versus how it's read. If it's updated regularly, then referencing is a good approach. This way enhances fast writes.
You can use a hybrid of embedding and referencing i.e. create an array of subdocuments with the frequently accessed data but with a reference to the actual document for more information.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Meetings messages field contains an array of Message object, while Meetings2 messages field contains an array of Message Id's.
var Meetings2 = new Schema({
...
messages: [{type: Schema.ObjectId, ref: 'Message'}],
...
});
can be written as
var Meetings2 = new Schema({
...
messages: [Schema.ObjectId],
...
});
The ref is just a helper function in mongoose, making it easier to populate the messages.
So in summary. In Meetings you embed the messages in an array, while in Meetings2 you reference the messages.

How to calculate Rating in my MongoDB design

I'm creating a system that users can write review about an item and rate it from 0-5. I'm using MongoDB for this. And my problem is to find the best solution to calculate the total rating in product schema. I don't think querying all comments to get the size and dividing it by total rating is a good solution. Here is my Schema. I appreciate any advice:
Comments:
var commentSchema = new Schema({
Rating : { type: Number, default:0 },
Helpful : { type: Number, default:0 },
User :{
type: Schema.ObjectId,
ref: 'users'
},
Content: String,
});
Here is my Item schema:
var productSchema = new Schema({
//id is barcode
_id : String,
Rating : { type: Number, default:0 },
Comments :[
{
type: Schema.ObjectId,
ref: 'comments'
}
],
});
EDIT: HERE is the solution I got from another topic : calculating average in Mongoose
You can get the total using the aggregation framework. First you use the $unwind operator to turn the comments into a document stream:
{ $unwind: "$Comments" }
The result is that for each product-document is turned into one product-document per entry in its Comments array. That comment-entry is turned into a single object under the field Comments, all other fields are taken from the originating product-document.
Then you use $group to rejoin the documents for each product by their _id, while you use the $avg operator to calculate the average of the rating-field:
{ $group: {
_id: "$_id",
average: { $avg: "$Comments.Rating" }
} }
Putting those two steps into an aggregation pipeline calculates the average rating for every product in your collection. You might want to narrow it down to one or a small subset of products, depending on what the user requested right now. To do this, prepend the pipeline with a $match step. The $match object works just like the one you pass to find().
The underlying question that it would be useful to understand is why you don't think that finding all of the ratings, summing them up, and dividing by the total number is a useful approach. Understanding the underlying reason would help drive a better solution.
Based on the comments below, it sounds like your main concern is performance and the need to run map-reduce (or another aggregation framework) each time a user wants to see total ratings.
This person addressed a similar issue here: http://markembling.info/2010/11/using-map-reduce-in-a-mongodb-app
The solution they identified was to separate out the execution of the map-reduce function from the need in the view to see the total value. In this case, the optimal solution would be to run the map-reduce periodically and store the results in another collection, and have the average rating based on the collection that stores the averages, rather than doing the calculation in real-time each time.
As I mentioned in the previous version of this answer, you can improve performance further by limiting the map-reduce to addresing ratings that were created or updated more recently, or since the last map-reduce aggregation.

create mongodb document with subdocuments atomically?

I hope I'm having a big brainfart moment. But here's my situation in a scraping szenario;
I'm wanting to be able to scrape over multiple machines and cores. Per site, I have different Front pages, I scrape (exmpl. for the site stackoverflow I'd have fronts stackoverflow.com/questions/tagged/javascript and stackoverflow.com/questions/tagged/nodejs).
An article could be on every Front and when I discover an article I want to create an Article if the url is unknown, if its known I want to make an Front entry in article.discover if Front is unknown and otherwise insert my FrontDiscovery to the apropriate Front.
Here are my Schemas;
FrontDiscovery = new Schema({
_id :{ type:ObjectId, auto:true },
date :{ type: Date, default:Date.now},
dims :{ type: Object, default:null},
pos :{ type: Object, default:null}
});
Front = new Schema({
_id :{ type:ObjectId, auto:true },
url :{type:String}, //front
found :[ FrontDiscovery ]
});
Article = new Schema({
_id :{ type:ObjectId, auto:true },
url :{ type: String , index: { unique: true } },
site :{ type: String },
discover:[ Front]
});
The Problem I am thinking I will eventually be running into is a race condition. When two job-runners (in parallel) find the same (before unknown) article and create a new one. Yes, I have a unique index on it and could handle it that way - quite inelegantly imho.
But lets go further; When - for what ever reason - my 2 job-runners are scraping the same front at the same time and both notice that for Front there is no entry yet and create a new one adding the FrontDiscovery, I'd end with two entry's for the same Front.
What are your strategies to circumvent such a situation? findByIdAndUpdate with the upsert:true for each document seperately? If so, how can I only push something to the embedded document collection and not overwrite everything else at the same time but still create the defaults if it hasnt been created?
Thank you for any help in directing me in the right direction! I really hope I'm having a massive brainfart..
Update with upsert=true can be used to perform an atomic "insert or update" (http://docs.mongodb.org/manual/core/update/#update-operations-with-the-upsert-flag).
For instance if we want to make sure a document in Front collection with specific url is inserted exactly once, we could run something like:
db.Front.update(
{url: 'http://example.com'},
{$set: {
url: 'http://example.com'},
found: true
}
)
Operations on a single document in MongoDB are always atomic. If you make updates that span over multiple documents, then no atomicity is guaranteed. In such cases, you can ask yourself: do I really need the operations to be atomic? If the answer is no, then you probably will find your way around working with potentially unconsistent data. If the answer is yes and you want to stick with MongoDB, check out the design pattern on Two Phase Commits.

Mongoose: populate() / DBref or data duplication?

I have two collections:
Users
Uploads
Each upload has a User associated with it and I need to know their details when an Upload is viewed. Is it best practice to duplicate this data inside the the Uploads record, or use populate() to pull in these details from the Users collection referenced by _id?
OPTION 1
var UploadSchema = new Schema({
_id: { type: Schema.ObjectId },
_user: { type: Schema.ObjectId, ref: 'users'},
title: { type: String },
});
OPTION 2
var UploadSchema = new Schema({
_id: { type: Schema.ObjectId },
user: {
name: { type: String },
email: { type: String },
avatar: { type: String },
//...etc
},
title: { type: String },
});
With 'Option 2' if any of the data in the Users collection changes I will have to update this across all associated Upload records. With 'Option 1' on the other hand I can just chill out and let populate() ensure the latest User data is always shown.
Is the overhead of using populate() significant? What is the best practice in this common scenario?
If You need to query on your Users, keep users alone. If You need to query on your uploads, keep uploads alone.
Another question you should ask yourself is: Every time i need this data, do I need the embedded objects (and vice-versa)? How many time this data will be updated? How many times this data will be read?
Think about a friendship request:
Each time you need the request you need the user which made the request, then embed the request inside the user document.
You will be able to create an index on the embedded object too, and your search will be mono query / fast / consistent.
Just a link to my previous reply on a similar question:
Mongo DB relations between objects
I think this post will be right for you http://www.mongodb.org/display/DOCS/Schema+Design
Use Cases
Customer / Order / Order Line-Item
Orders should be a collection. customers a collection. line-items should be an array of line-items embedded in the order object.
Blogging system.
Posts should be a collection. post author might be a separate collection, or simply a field within posts if only an email address. comments should be embedded objects within a post for performance.
Schema Design Basics
Kyle Banker, 10gen
http://www.10gen.com/presentation/mongosf2011/schemabasics
Indexing & Query Optimization
Alvin Richards, Senior Director of Enterprise Engineering
http://www.10gen.com/presentation/mongosf-2011/mongodb-indexing-query-optimization
**These 2 videos are the bests on mongoddb ever seen imho*
Populate() is just a query. So the overhead is whatever the query is, which is a find() on your model.
Also, best practice for MongoDB is to embed what you can. It will result in a faster query. It sounds like you'd be duplicating a ton of data though, which puts relations(linking) at a good spot.
"Linking" is just putting an ObjectId in a field from another model.
Here is the Mongo Best Practices http://www.mongodb.org/display/DOCS/Schema+Design#SchemaDesign-SummaryofBestPractices
Linking/DBRefs http://www.mongodb.org/display/DOCS/Database+References#DatabaseReferences-SimpleDirect%2FManualLinking

Resources