I've been reading a lot of best practices and how I should embrace the _id. To be honest, I'm getting my kind of paranoid at the repercussions I might face if I don't do this for when I start scaling up my application.
Currently I have about 50k documents per database. It's only been a few months with heavy usage. I expect this to grow A LOT. I do a lot of .find() Mango Queries, not much indexing; and to be honest working off a relational style document structuring.
For example:
First get Project from ID.
Then do a find query that:
grabs all type:signature where project_id: X.
grabs all type:revisions where project_id: X.
The reason for this is I try VERY hard not to update documents. A lot of these documents are created offline, so doing a write once workflow is very important for me to avoid conflicts.
I'm currently at a point of no return as scheduling is getting pretty intense. If I want to change the way I'm doing things now is the best time before it gets too crazy.
I'd love to hear your thoughts about using the _id for data structuring and what people think.
Being able to make one call with a _all_docs grab like this sounds appealing to me:
{
"include_docs": true,
"startkey": "project:{ID}",
"endkey": "project:{ID}:\ufff0"
}
An example of how ONE type of my documents are set is like so:
Main Document
{
_id: {COUCH_GENERATED_1},
type: "project",
..
.
}
Signature Document
{
_id: {COUCH_GENERATED_2},
type: "signature",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
}
Change to Main Document
{
_id: {COUCH_GENERATED_3},
type: "revision",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
data: [{..}]
}
I was wondering whether I should do something like this:
Main Document: _id: project:{kuuid_1}
Signature Document: _id: project:{kuuid_1}:signature:{kuuid_2}
Change to Main Document: _id: project:{kuuid_1}:rev:{kuuid_3}
I'm just trying to set up my database in a way that isn't going to mess with me in the future. I know problems are going to come up but I'd like not to heavily change the structure if I can avoid it.
Another reason I am thinking of this is that I watch for _changes in my databases and being able to know what types are coming through without getting each document every time a document changes sound appealing also.
Setting up your database structure so that it makes data retrieval easier is good practice. It seems to me you have some options:
If there is a field called project_id in the documents of interest, you can create an index on project_id which would allow you to fetch all documents pertaining to a known project_id cheaply. see CouchDB Find
Create a MapReduce index keyed on project_id e.g if (doc.project_id) { emit(doc.project_id)}. The index that this produces would allow you to fetch documents by known project_id with judicious use of start_key& end_key when querying the view. see Introduction to views
As you say, packing more information into the _id field allows you to perform range queries on the _all_docs endpoint.
If you choose a key design of:
project{project_id}:signature{kuuid}
then the primary index of the database has all of a single project's documents grouped together. Putting the project_id before the ':' character is preparation for a forthcoming CouchDB feature called "partitioned databases", which groups logically related documents in their own partition, making it quicker and easier to perform queries on a single partition, in your case a project. This feature isn't ready yet but it's likely to have a {partition_key}:{document_key} format for the _id field, so there's no harm in getting your document _ids ready for it for when it lands (see CouchDB mailing list! In the meantime, a range query on _all_docs will work.
Related
I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}
I'm creating a web app using Mongoose/MongoDB to store information that will be voted on. I'll be storing usernames and IP addresses with the vote (so voters can update/modify their votes if desired).
Root Question: What's the best way to securely architecture voting in a Mongoose schema?
Currently, my schema looks like this (simplified):
var Thing = new Schema({
title: {
type: String
},
creator: {
type: String
},
options: [{
description: {
type: String
},
votes: [{
username: {
type: String
},
ip: {
type: String
}
}]
}]
});
mongoose.model('Thing', Thing);
While this makes querying the db for any given Thing super easy, it becomes more problematic for security for obvious reasons - I don't want to be returning out usernames and ip addresses to the browser.
The problem is, I'm not sure which is the best/least painful scenario for securely returning Thing data to the browser:
Loop through each option in Thing.options, then sub-loop through each vote in Thing.options[i].votes to find the vote cast by the user requesting the data, then delete all votes to get rid of other user data. This seems to be very resource intensive, but I couldn't find a way to use indexOf in subarrays (guidance welcome on this one), i.e. Thing.options.votes.indexOf(username) or something to that effect.
Store vote information in the already-existing User schema, then have to search through all users for vote data and stick it all together every time I want query a single Thing. This also seems inefficient/more resource intensive/more complicated than necessary.
Create a separate Vote schema that stores the data more conveniently, but then adds another database call (one for the Thing, one for the Vote).
This problem is somewhat compounded by the fact that there are different ways to vote, with this being the simplest.
Research...for posterity's sake:
This question addresses voting in databases, but for a relational db, not MongoDB/Mongoose.
This question addresses Mongoose/Node.js app architecture, but nothing about votes.
This NPM Module adds voting to Mongoose schemas, but doesn't quite fit my needs.
This post looks very promising, as the author is sort of doing what I'm describing in point 1 above (see Listing 13 on the author's post), but he still creates a nested loop, starting in line 22 of Listing 13, to loop through each choice/option, then through each vote for each choice/option.
As a quick hint - to prevent leaking of IP addresses from DB - I would suggest to add extra collection which will store all vote sensitive data, but still have other vote data in same document.
This gives small overhead when storing data, but by design IP info will be not provided to caller and there is no need for extra data scrubbing on every call, to secure data.
I am creating an application with nodejs and mongod(Not mongoose). I have a problem that gave me headache over few days, anyone please suggest a way for this!!.
I have a mongodb design like this
post{
_id:ObjectId(...),
picture: 'some_url',
comments:[
{_id:ObjectId(...),
user_id:Object('123456'),
body:"some content"
},
{_id:ObjectId(...),
user_id:Object('...'),
body:"other content"
}
]
}
user{
_id:ObjectId('123456'),
name: 'some name', --> changable at any times
username: 'some_name', --> changable at any times
picture: 'url_link' --> changable at any times
}
I want to query the post along with all the user information so the query will look like this:
[{
_id:ObjectId(...),
picture: 'some_url',
comments:[
{_id:ObjectId(...),
user_id:Object('123456'),
user_data:{
_id:ObjectId('123456'),
name: 'some name',
username: 'some_name',
picture: 'url_link'
}
body:"some content"
},
{_id:ObjectId(...),
user_id:Object('...'),
body:"other content"
}
]
}]
I tried to use loop to manually get the user data and add to comment but it proves to be difficult and not achievable by my coding skill :(
Please anybody got any suggestion, I would be really appreciated.
P/s I am trying another approach that I would embedded all the user data in to the comment and whenever the user update their username, name or picture. They will update it in all the comment as well
The problem(s)
As written before, there are several problems when over-embedding:
Problem 1: BSON size limit
As of the time of this writing, BSON documents are limited to 16MB. If that limit is reached, MongoDB would throw an exception and you simply could not add more comments and in worst case scenarios not even change the (user-)name or the picture if the change would increase the size of the document.
Problem 2: Query limitations and performance
It is not easily possible to query or sort the comments array under certain conditions. Some things would require a rather costly aggregation, others rather complicated statements.
While one could argue that once the queries are in place, this isn't much of a problem, I beg to differ. First, the more complicated a query is, the harder it is to optimize, both for the developer and subsequently MongoDBs query optimizer. I have had the best results with simplyfying data models and queries, speeding up responses by a factor of 100 in one instance.
When scaling, the ressources needed for complicated and/or costly queries might even sum up to whole machines when compared to a simpler data model and according queries.
Problem 3: Maintainability
Last but not least you might well run into problems maintaining your code. As a simple rule of thumb
The more complicated your code becomes, the harder it is to maintain. The harder code is to maintain, the more time it needs to maintain the code. The more time it needs to maintain code, the more expensive it gets.
Conclusion: Complicated code is expensive.
In this context, "expensive" both refers to money (for professional projects) and time (for hobby projects).
(My!) Solution
It is pretty easy: simplify your data model. Consequently, your queries will become less complicated and (hopefully) faster.
Step 1: Identify your use cases
That's going to be a wild guess for me, but the important thing here is to show you the general method. I'd define your use cases as follows:
For a given post, users should be able to comment
For a given post, show the author and the comments, along with the commenters and authors username and their picture
For a given user, it should be easily possible to change the name, username and picture
Step 2: Model your data accordingly
Users
First of all, we have a straightforward user model
{
_id: new ObjectId(),
name: "Joe Average",
username: "HotGrrrl96",
picture: "some_link"
}
Nothing new here, added just for completeness.
Posts
{
_id: new ObjectId()
title: "A post",
content: " Interesting stuff",
picture: "some_link",
created: new ISODate(),
author: {
username: "HotGrrrl96",
picture: "some_link"
}
}
And that's about it for a post. There are two things to note here: first, we store the author data we immediately need when displaying a post, since this saves us a query for a very common, if not ubiquitous use case. Why don't we save the comments and commenters data acordingly? Because of the 16 MB size limit, we are trying to prevent the storage of references in a single document. Rather, we store the references in comment documents:
Comments
{
_id: new ObjectId(),
post: someObjectId,
created: new ISODate(),
commenter: {
username: "FooBar",
picture: "some_link"
},
comment: "Awesome!"
}
The same as with posts, we have all the necessary data for displaying a post.
The queries
What we have achieved now is that we circumvented the BSON size limit and we don't need to refer to the user data in order to be able to display posts and comments, which should save us a lot of queries. But let's come back to the use cases and some more queries
Adding a comment
That's totally straightforward now.
Getting all or some comments for a given post
For all comments
db.comments.find({post:objectIdOfPost})
For the 3 lastest comments
db.comments.find({post:objectIdOfPost}).sort({created:-1}).limit(3)
So for displaying a post and all (or some) of its comments including the usernames and pictures we are at two queries. More than you needed before, but we circumvented the size limit and basically you can have an indefinite number of comments for every post. But let's get to something real
Getting the latest 5 posts and their latest 3 comments
This is a two step process. However, with proper indexing (will come back to that later) this still should be fast (and hence resource saving):
var posts = db.posts.find().sort({created:-1}).limit(5)
posts.forEach(
function(post) {
doSomethingWith(post);
var comments = db.comments.find({"post":post._id}).sort("created":-1).limit(3);
doSomethingElseWith(comments);
}
)
Get all posts of a given user sorted from newest to oldest and their comments
var posts = db.posts.find({"author.username": "HotGrrrl96"},{_id:1}).sort({"created":-1});
var postIds = [];
posts.forEach(
function(post){
postIds.push(post._id);
}
)
var comments = db.comments.find({post: {$in: postIds}}).sort({post:1, created:-1});
Note that we have only two queries here. Although you need to "manually" make the connection between posts and their respective comments, that should be pretty straightforward.
Change a username
This presumably is a rare use case executed. However, it isn't very complicated with said data model
First, we change the user document
db.users.update(
{ username: "HotGrrrl96"},
{
$set: { username: "Joe Cool"},
$push: {oldUsernames: "HotGrrrl96" }
},
{
writeConcern: {w: "majority"}
}
);
We push the old username to an according array. This is a security measure in case something goes wrong with the following operations. Furthermore, we set the write concern to a rather high level in order to make sure the data is durable.
db.posts.update(
{ "author.username": "HotGrrrl96"},
{ $set:{ "author.username": "Joe Cool"} },
{
multi:true,
writeConcern: {w:"majority"}
}
)
Nothing special here. The update statement for the comments looks pretty much the same. While those queries take some time, they are rarely executed.
The indices
As a rule of thumb, one can say that MongoDB can only use one index per query. While this is not entirely true since there are index intersections, it is easy to deal with. Another thing is that individual fields in a compound index can be used independently. So an easy approach to index optimization is to find the query with the most fields used in operations which make use of indices and create a compound index of them. Note that the order of occurrence in the query matters. So, let's go ahead.
Posts
db.posts.createIndex({"author.username":1,"created":-1})
Comments
db.comments.createIndex({"post":1, "created":-1})
Conclusion
A fully embedded document per post admittedly is the the fastest way of loading it and it's comments. However, it does not scale well and due to the nature of possibly complex queries necessary to deal with it, this performance advantage may be leveraged or even eliminated.
With the above solution, you trade some speed (if!) against basically unlimited scalability and a much more straightforward way of dealing with the data.
Hth.
You are following Normalized data model approach. if you are following this model means, you have to write another query to get the user info or If you uses the embedded document store then all the user doc must change whenever updates on user doc.
http://docs.mongodb.org/v3.0/reference/database-references/ read this link for more information.
I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?
This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:
Put as much in as possible
The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.
This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations
MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:
MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)
It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency
MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.
For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.
In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.
Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)
I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.
Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML
If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.
Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html
I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})
MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)
I'm considering using CouchDB for an upcoming site, but I'm a little confused as far as far as how to implement a system of user ratings for the site. Basically, each item of content can be rated by a given user. What way of doing this makes the most sense in the CouchDB model? I would think the DRYest and most logical way would be to have 3 different document types, Content, Users, and a user_rating doc that looks something like this.
{
user_id: "USERID"
content_id: "CONTENTID"
rating: 6
}
Then, I'd create a view where the map was the set of all content docs and user_rating docs keyed by content doc ids, and where the reduce tallied the mean of the ratings and returned the content doc keyed by content doc id.
Is that the best way of doing this? I haven't yet found much in the way of resources on CouchDB best practices so I'm pretty unsure of all this stuff.
My Conclusion:
The accepted answer below, which is what I pretty much was going to implement does work, but beware, the docs need to be keyed by content doc id which makes advanced queries based on other document properties troublesome. I'm going back to SQL for my needs in this app.
Sounds like you've got a reasonable idea going. CouchDB is so new that I think it'll take awhile for best practices to shake out.
A map/reduce pair like this might form a reasonable starting point.
map:
function(doc) {
if(doc.type='rating' && doc.content_id) {
emit(doc.content_id, doc.rating);
}
}
reduce:
function(keys, values) {
return sum(values)/values.length
}
NB: That map function requires adding the proper type to your Rating model:
{
type: 'rating',
user_id: "USERID",
content_id: "CONTENTID",
rating: 6
}
Well Damien Katz, one of the Couchdb developers, gives a description of a similar process, so you might be doing it the way that the Couchdb folks intend.
I wrote about a similar situation (although simpler than your example). I was adding article ratings to my blog and decided to use CouchDB to store the ratings themselves. I think you've got the right idea.
Here's a thought, though. Do you care who rated what, like for display somewhere or tracking? If so, carry on :)
If not then why not just update the content document's rating attribute to += 1 (and perhaps the user document's rated attribute to .push( doc._id ) if you want to prevent a user from rating content more than once).
This would greatly simplify your document handling and give better performance when 'reading' ratings to display on pages (since you'll already have the content document assumingly)... This would be at the cost of making the actual process of rating more expensive (bigger documents going to the server, etc).
Seems to me that sometimes CouchDB (and other key-value databases) are at their best when things aren't totally normalized.