Mongo/Mongoose: cleanup orphaned refs - node.js

Suppose we a typical one-to-many relationship modeled using references as suggested by MongoDB official documentation:
var User = mongoose.Schema({
});
var Group = mongoose.Schema({
user: [{
type: mongoose.Schema.Types.ObjectId,
ref: 'User'
}]
});
Let's also assume I care about the order, in which users appear in the group, so the array is necessary.
Now, for let's assume that the user has been deleted -- and the groups have not been maintained with $pull for some reason. If you use Mongoose's populate everything looks fine, but the garbage persists in the array.
Is there a way to identify the orphaned refs and remove them? Maybe even automatically -- similarly to what CASCADE does in relational world? What's the best approach to maintain the referential integrity in Mongo/Mongoose? Finally, what's the most efficient one?

First, use a remove hook on your User model to try to maintain data integrity on an ongoing basis: User.post('remove', pullUserFromGroups); Hopefully that will keep integrity mostly intact. You can remove the user from every group with a single $pull operation. This is a mongo analog to a CASCADE from relational DBs.
For after-the fact cleanup you need to iterate over every Group, find every userId in group.user, query to see if the record exists, and pull it out if not. It's simplest to just do this one at a time, but you could also use User.find({_id: {$in: group.user}}) and then calculate the user IDs not found and pull them that way.

Related

Limit N documents per user for a specific Schema

I think it would be easier for me to explain starting with an example.
Let's say we have this Schema:
var entrySchema = new Schema({
_id: mongoose.Schema.Types.ObjectId,
user: {
type: String,
ref: User
},
// more fields here
});
So basically there is a 1:M kind of relationship where a user can have multiple entries stored in the DB.
Now for optimization's sake and to reduce costs I would like to only allow N (50-100) entries to be stored for a specific user.
Of course, the trivial solution would be to check each time I add an entry for a user if the limit has been reached and delete the oldest entry.
I was wondering if there is any built-in mechanism to make this easier to implement. I'm not saying it's hard to implement, just that it looks like something that maybe could be solved with a feature of mongo/mongoose that I'm not aware of.
I'm having a hard time finding anything in both mongo documentation and mongoose.
Note an npm package is welcome too.

In CouchDB, should I use the _id for relation and _changes?

I've been reading a lot of best practices and how I should embrace the _id. To be honest, I'm getting my kind of paranoid at the repercussions I might face if I don't do this for when I start scaling up my application.
Currently I have about 50k documents per database. It's only been a few months with heavy usage. I expect this to grow A LOT. I do a lot of .find() Mango Queries, not much indexing; and to be honest working off a relational style document structuring.
For example:
First get Project from ID.
Then do a find query that:
grabs all type:signature where project_id: X.
grabs all type:revisions where project_id: X.
The reason for this is I try VERY hard not to update documents. A lot of these documents are created offline, so doing a write once workflow is very important for me to avoid conflicts.
I'm currently at a point of no return as scheduling is getting pretty intense. If I want to change the way I'm doing things now is the best time before it gets too crazy.
I'd love to hear your thoughts about using the _id for data structuring and what people think.
Being able to make one call with a _all_docs grab like this sounds appealing to me:
{
"include_docs": true,
"startkey": "project:{ID}",
"endkey": "project:{ID}:\ufff0"
}
An example of how ONE type of my documents are set is like so:
Main Document
{
_id: {COUCH_GENERATED_1},
type: "project",
..
.
}
Signature Document
{
_id: {COUCH_GENERATED_2},
type: "signature",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
}
Change to Main Document
{
_id: {COUCH_GENERATED_3},
type: "revision",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
data: [{..}]
}
I was wondering whether I should do something like this:
Main Document: _id: project:{kuuid_1}
Signature Document: _id: project:{kuuid_1}:signature:{kuuid_2}
Change to Main Document: _id: project:{kuuid_1}:rev:{kuuid_3}
I'm just trying to set up my database in a way that isn't going to mess with me in the future. I know problems are going to come up but I'd like not to heavily change the structure if I can avoid it.
Another reason I am thinking of this is that I watch for _changes in my databases and being able to know what types are coming through without getting each document every time a document changes sound appealing also.
Setting up your database structure so that it makes data retrieval easier is good practice. It seems to me you have some options:
If there is a field called project_id in the documents of interest, you can create an index on project_id which would allow you to fetch all documents pertaining to a known project_id cheaply. see CouchDB Find
Create a MapReduce index keyed on project_id e.g if (doc.project_id) { emit(doc.project_id)}. The index that this produces would allow you to fetch documents by known project_id with judicious use of start_key& end_key when querying the view. see Introduction to views
As you say, packing more information into the _id field allows you to perform range queries on the _all_docs endpoint.
If you choose a key design of:
project{project_id}:signature{kuuid}
then the primary index of the database has all of a single project's documents grouped together. Putting the project_id before the ':' character is preparation for a forthcoming CouchDB feature called "partitioned databases", which groups logically related documents in their own partition, making it quicker and easier to perform queries on a single partition, in your case a project. This feature isn't ready yet but it's likely to have a {partition_key}:{document_key} format for the _id field, so there's no harm in getting your document _ids ready for it for when it lands (see CouchDB mailing list! In the meantime, a range query on _all_docs will work.

Which mongoose model would be more efficient?

I am new to no-sql. I am trying to build a simple e-commerce app in nodejs. Now for the product i need to build CRUD operations so that only owner can edit them, rest have READ-ONLY access. The main question is which would be a better implementation ?
The current code that i have is like
var mongoose = require('mongoose');
module.exports = mongoose.model('product',new mongoose.Schema({
owner : {type: String},
title : {type: String},
...
}));
The owner is actually the _id from my user model. Basically this is something like a foreign key. Is this the valid method to go around or should i add an array inside the user model to store the list of objects that he owns ?
Also i would like to have validation if what i just did for owner, storing UID in String is best practice or should i do something else to reference the user model.
Thanks in advance for helping me out.
The whole point of document databases is you shouldn't have foreign relationships; All the data your document needs should be denormalized in the document.
So inside your product document, you should be duplicating all the owner details you need. You can store their _id as well for lookup, but don't use a string for this, use an actual ObjectId().
For more about denormalization see The Little MongoDB Book
Yet another alternative to using joins is to denormalize your data. Historically, denormalization was reserved for performance-sensitive code, or when data should be snapshotted (like in an audit log). However, with the ever- growing popularity of NoSQL, many of which don’t have joins, denormalization as part of normal modeling is becoming increasingly common. This doesn’t mean you should duplicate every piece of information in every document. However, rather than letting fear of duplicate data drive your design decisions, consider modeling your data based on what information belongs to what document.

The "right way" to architecture voting with Mongoose?

I'm creating a web app using Mongoose/MongoDB to store information that will be voted on. I'll be storing usernames and IP addresses with the vote (so voters can update/modify their votes if desired).
Root Question: What's the best way to securely architecture voting in a Mongoose schema?
Currently, my schema looks like this (simplified):
var Thing = new Schema({
title: {
type: String
},
creator: {
type: String
},
options: [{
description: {
type: String
},
votes: [{
username: {
type: String
},
ip: {
type: String
}
}]
}]
});
mongoose.model('Thing', Thing);
While this makes querying the db for any given Thing super easy, it becomes more problematic for security for obvious reasons - I don't want to be returning out usernames and ip addresses to the browser.
The problem is, I'm not sure which is the best/least painful scenario for securely returning Thing data to the browser:
Loop through each option in Thing.options, then sub-loop through each vote in Thing.options[i].votes to find the vote cast by the user requesting the data, then delete all votes to get rid of other user data. This seems to be very resource intensive, but I couldn't find a way to use indexOf in subarrays (guidance welcome on this one), i.e. Thing.options.votes.indexOf(username) or something to that effect.
Store vote information in the already-existing User schema, then have to search through all users for vote data and stick it all together every time I want query a single Thing. This also seems inefficient/more resource intensive/more complicated than necessary.
Create a separate Vote schema that stores the data more conveniently, but then adds another database call (one for the Thing, one for the Vote).
This problem is somewhat compounded by the fact that there are different ways to vote, with this being the simplest.
Research...for posterity's sake:
This question addresses voting in databases, but for a relational db, not MongoDB/Mongoose.
This question addresses Mongoose/Node.js app architecture, but nothing about votes.
This NPM Module adds voting to Mongoose schemas, but doesn't quite fit my needs.
This post looks very promising, as the author is sort of doing what I'm describing in point 1 above (see Listing 13 on the author's post), but he still creates a nested loop, starting in line 22 of Listing 13, to loop through each choice/option, then through each vote for each choice/option.
As a quick hint - to prevent leaking of IP addresses from DB - I would suggest to add extra collection which will store all vote sensitive data, but still have other vote data in same document.
This gives small overhead when storing data, but by design IP info will be not provided to caller and there is no need for extra data scrubbing on every call, to secure data.

Mongoose - "object" in "object" [duplicate]

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?
This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:
Put as much in as possible
The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.
This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations
MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:
MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)
It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency
MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.
For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.
In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.
Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)
I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.
Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML
If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.
Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html
I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})
MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)

Resources