I'm building a photo/video sharing social network using MongoDB. The social network has a feed, profiles and a follower model. I basically followed a similar approach to this article for my "social feed" design. Specifically, I used the fan-out on write with bucket approach when users posts stories.
My issue is when a user "likes" a story. I'm currently also using the fan-out on write approach that basically increments/decrements a story's "like count" for every user's feed. I think this might be a bad design since users "like" more frequently than they post. Users can quickly saturate the server by liking and unliking a popular post.
What design pattern do you guys recommend here? Should I use fan-out on read? Keep using Fan-out on write with Background workers? If the solution is "background workers", what approach do you recommend using for background workers? 'm using Node.js.
Any help is appreciated!
Thanks,
Henri
I think the best approach is:
1. increasing-decreasing a counter in your database to keep track of the number of like
2. insert in a collection called 'like' each like as a single document, where you track the id of the users who likes the story and the id of the liked story.
Then if you just need the number of likes you can access the counter data and it's really fast, instead if you need to know where the likes where from you will query the collection called 'like' querying by story id and get all users' ids who liked the story.
The documents i am talking about in the like collection will be like so:
{_id: 'dfggsdjtsdgrhtd'
'story_id': 'ertyerdtyfret',
'user_id': 'sdrtyurertyuwert'}
You can store the counter in the story's document itself:
{
...
likes: 56
}
You can also keep track of last likes in your story's document (for example 1000. last because mongodb's documents have limited size to 16 mb and if your application scales so much you will meet problem in storing potential unlimited data in a single document). With this approach you can easily query the 'likes' collection and get last likes.
When someone unlikes a story you can simply remove the like document from 'like' collection, or, as better approach, (e.g: you are sending notification when someone's story is liked), just store in that document that was unliked, so that if it will be liked again by the same user you will have checked that the like was already inserted and you won't send another notification.
example:
first time insert:
{_id: 'dfggsdjtsdgrhtd'
'story_id': 'ertyerdtyfret',
'user_id': 'sdrtyurertyuwert'
active: true}
When unliked update to this
{_id: 'dfggsdjtsdgrhtd'
'story_id': 'ertyerdtyfret',
'user_id': 'sdrtyurertyuwert'
active: false}
When each like is added check if there's an existing document with the same story id and the same user id. If there is, if active is false it means the user already liked and unliked the story so that if it will be liked again you won't send already-sent notification!
Related
I am trying to implement relations on collections. My requirement is
Post request 1, json body:
{
"username":"aaa",
"password":"bbb",
"role":"owner",
"company":"SAS"
}
Post request 2, creating from first document so I got company name from previous json body:
{
"username":"eee",
"password":"fff",
"role":"engineer",
"company":"SAS"
}
Post request 3, creating from first document so I got company name from previous json body:
{
"username":"uuu",
"password":"kkk",
"role":"engineer",
"company":"SAS"
}
Post request 4, next company json body:
{
"username":"hhh",
"password":"ggg",
"role":"owner",
"company":"GVG"
}
Here company is foreign key field. How can I achieve company with id field without failing one another like transactions.
In mysql I will create two tables company, user and using transactions i will insert in both tables in single post using id's if any update in company name id will remain same for owner and engineer.
How can I achieve these in mongodb, with node.js?
In online searches I have found most suggest avoid transactions and using mongodb functionalities like mongodb embedded.
I would suggest you to start with making schemas for user and company using mongoose. Its an ODM(object document mapper) which is almost always used with node.js and mongodb
Now this is one to many relations. In relational databases as you have mentioned, you would make a company table and a user table.
In mongodb it "depends". If its one to "few" relationship you would just nest the users array into company's collection. Then since you are only updating a single document(pushing user to users array in company's document), you wont be needing any transactions. Single document update is always atomic(no matter how many fields you update on the same document).
But if each company can have large number of users(ever growing nested array is not good, as it can cause data fragmentation and bad performance), then its better to store the company's id in user's document. And even in this case you would not need transaction, since you are not updating the company's document.
Another reason for storing user as separate collection, is query issues. If you just want to query users its difficult if they are nested in companies. So basically you need to consider how you will query and figure out the number of relations then decide to nest of store is separate collections.
First of all, you should notice that Mongo is document-oriented DB, not a relation one. So if you need transactions and relation model, probably you should try to use any SQL relative database? Especially if you are more familiar with them?
About relation and data modeling: you should this article (or even entire part) at official Mongo docs, Data Modelling.
TL:DR, you could create two separate collections (the same as tables in SQL) like employees, and companies (by default, collection's name will be in plural forms). And store data separately.
So you employees will be stored like you mention above, but companies will be like:
{
_id: ObjectID("35473645632")
name: "SAS"
}, ...
and as for your employees collection, you should store not like, "company":"SAS", but, "company":"ObjectID("35473645632"), or even as array if you want it too. But don't forgot to edit you schema than.
You could use not just MongoDB's default _id but your own one, it could be any unique number/string combination
So, if your company will be renamed, your connection with other documents (employees) still will be there.
To request all/any of your employees with company name's you should use .aggregation framework with $lookup, instead of .find.
I'm currently trying to learn Node.js and Mongoodb by building the server side of a web application which should manage insurance documents for the insurance agent.
So let's say i'm the user, I sign in, then I start to add my customers and their insurances.
So I have 2 collection related, Customers and Insurances.
I have one more collection to store the users login data, let's call it Users.
I don't want the new users to see and modify the customers and the insurances of other users.
How can I "divide" every user related record, so that each user can work only with his data?
I figured out I can actually add to every record, the _id of the one user who created the record.
For example I login as myself, I got my Id "001", I could add one field with this value in every customer and insurance.
In that way I could filter every query with this code.
Would it be a good idea? In my opinion this filtering is a waste of processing power for mongoDB.
If someone has any idea of a solution, or even a link to an article about it, it would be helpful.
Thank you.
This is more a general permissions problem than just a MongoDB question. Also, without knowing more about your schemas it's hard to give specific advice.
However, here are some approaches:
1) Embed sub-documents
Since MongoDB is a document store allowing you to store arbitrary JSON-like objects, you could simply store the customers and licenses wholly inside each user object. That way querying for a user would return their customers and licenses as well.
2) Denormalise
Common practice for NoSQL databases is to denormalise related data (ie. duplicate the data). This might include embedding a sub-document that is a partial representation of your customers/licenses/whatever inside your user document. This has the similar benefit to the above solution in that it eliminates additional queries for sub-documents. It also has the same drawbacks of requiring more care to be taken for preserving data integrity.
3) Reference with foreign key
This is a more traditionally relational approach, and is basically what you're suggesting in your question. Depending on whether you want the reference to be bi-directional (both documents reference each other) or uni-directional (one document references the other) you can either store the user's ID as a foreign user_id field, or store an array of customer_ids and insurance_ids in the user document. In relational parlance this is sometimes described to as "has many" or "belongs to" (the user has many customers, the customer belongs to a user).
I've been reading a lot of best practices and how I should embrace the _id. To be honest, I'm getting my kind of paranoid at the repercussions I might face if I don't do this for when I start scaling up my application.
Currently I have about 50k documents per database. It's only been a few months with heavy usage. I expect this to grow A LOT. I do a lot of .find() Mango Queries, not much indexing; and to be honest working off a relational style document structuring.
For example:
First get Project from ID.
Then do a find query that:
grabs all type:signature where project_id: X.
grabs all type:revisions where project_id: X.
The reason for this is I try VERY hard not to update documents. A lot of these documents are created offline, so doing a write once workflow is very important for me to avoid conflicts.
I'm currently at a point of no return as scheduling is getting pretty intense. If I want to change the way I'm doing things now is the best time before it gets too crazy.
I'd love to hear your thoughts about using the _id for data structuring and what people think.
Being able to make one call with a _all_docs grab like this sounds appealing to me:
{
"include_docs": true,
"startkey": "project:{ID}",
"endkey": "project:{ID}:\ufff0"
}
An example of how ONE type of my documents are set is like so:
Main Document
{
_id: {COUCH_GENERATED_1},
type: "project",
..
.
}
Signature Document
{
_id: {COUCH_GENERATED_2},
type: "signature",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
}
Change to Main Document
{
_id: {COUCH_GENERATED_3},
type: "revision",
project_id: {COUCH_GENERATED_1},
created_at: {UNIX_TIMESTAMP}
data: [{..}]
}
I was wondering whether I should do something like this:
Main Document: _id: project:{kuuid_1}
Signature Document: _id: project:{kuuid_1}:signature:{kuuid_2}
Change to Main Document: _id: project:{kuuid_1}:rev:{kuuid_3}
I'm just trying to set up my database in a way that isn't going to mess with me in the future. I know problems are going to come up but I'd like not to heavily change the structure if I can avoid it.
Another reason I am thinking of this is that I watch for _changes in my databases and being able to know what types are coming through without getting each document every time a document changes sound appealing also.
Setting up your database structure so that it makes data retrieval easier is good practice. It seems to me you have some options:
If there is a field called project_id in the documents of interest, you can create an index on project_id which would allow you to fetch all documents pertaining to a known project_id cheaply. see CouchDB Find
Create a MapReduce index keyed on project_id e.g if (doc.project_id) { emit(doc.project_id)}. The index that this produces would allow you to fetch documents by known project_id with judicious use of start_key& end_key when querying the view. see Introduction to views
As you say, packing more information into the _id field allows you to perform range queries on the _all_docs endpoint.
If you choose a key design of:
project{project_id}:signature{kuuid}
then the primary index of the database has all of a single project's documents grouped together. Putting the project_id before the ':' character is preparation for a forthcoming CouchDB feature called "partitioned databases", which groups logically related documents in their own partition, making it quicker and easier to perform queries on a single partition, in your case a project. This feature isn't ready yet but it's likely to have a {partition_key}:{document_key} format for the _id field, so there's no harm in getting your document _ids ready for it for when it lands (see CouchDB mailing list! In the meantime, a range query on _all_docs will work.
I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.
I'm considering using CouchDB for an upcoming site, but I'm a little confused as far as far as how to implement a system of user ratings for the site. Basically, each item of content can be rated by a given user. What way of doing this makes the most sense in the CouchDB model? I would think the DRYest and most logical way would be to have 3 different document types, Content, Users, and a user_rating doc that looks something like this.
{
user_id: "USERID"
content_id: "CONTENTID"
rating: 6
}
Then, I'd create a view where the map was the set of all content docs and user_rating docs keyed by content doc ids, and where the reduce tallied the mean of the ratings and returned the content doc keyed by content doc id.
Is that the best way of doing this? I haven't yet found much in the way of resources on CouchDB best practices so I'm pretty unsure of all this stuff.
My Conclusion:
The accepted answer below, which is what I pretty much was going to implement does work, but beware, the docs need to be keyed by content doc id which makes advanced queries based on other document properties troublesome. I'm going back to SQL for my needs in this app.
Sounds like you've got a reasonable idea going. CouchDB is so new that I think it'll take awhile for best practices to shake out.
A map/reduce pair like this might form a reasonable starting point.
map:
function(doc) {
if(doc.type='rating' && doc.content_id) {
emit(doc.content_id, doc.rating);
}
}
reduce:
function(keys, values) {
return sum(values)/values.length
}
NB: That map function requires adding the proper type to your Rating model:
{
type: 'rating',
user_id: "USERID",
content_id: "CONTENTID",
rating: 6
}
Well Damien Katz, one of the Couchdb developers, gives a description of a similar process, so you might be doing it the way that the Couchdb folks intend.
I wrote about a similar situation (although simpler than your example). I was adding article ratings to my blog and decided to use CouchDB to store the ratings themselves. I think you've got the right idea.
Here's a thought, though. Do you care who rated what, like for display somewhere or tracking? If so, carry on :)
If not then why not just update the content document's rating attribute to += 1 (and perhaps the user document's rated attribute to .push( doc._id ) if you want to prevent a user from rating content more than once).
This would greatly simplify your document handling and give better performance when 'reading' ratings to display on pages (since you'll already have the content document assumingly)... This would be at the cost of making the actual process of rating more expensive (bigger documents going to the server, etc).
Seems to me that sometimes CouchDB (and other key-value databases) are at their best when things aren't totally normalized.