How to store votes for CouchDB document? - couchdb

I am looking for a good example how to store votes in a document.
For example if we have a document which is post and users can vote for it.
If I store the vote in a field in the document, for example:
votes : 12345
What will happen if the author is editing the post and during this time someone votes? The author is not going to be able to save, because somebody voted and document will be with new revision.
The other option is to store votes separately, each vote to be document, or to create a document with votes for every post?
If I decide to store every vote in a different document, how difficult it's going to be to aggregate this data? Or I have to calculate it each time when I show the document?
What are your solutions?
regards

This will result in a conflict. There's a chapter in the CouchDB Guide about handling conflicts.
http://guide.couchdb.org/draft/conflicts.html
If you use a middleware (such as PHP) it can recognize and handle the conflict. (see wiki for example code: http://wiki.apache.org/couchdb/Replication_and_conflicts)
If you want to offer a pure CouchApp it should be possible to use update handlers to manage some common conflict cases automatically. http://wiki.apache.org/couchdb/Document_Update_Handlers
If it works I would prefer to store the votes in the document. But I did not try any of these approaches for myself yet. So I would be happy If you share your solution.

I found this article to be very helpful on the subject of how to avoid conflicts when many users will be updating a document, such as voting or adding comments to a blog post.
http://www.cmlenz.net/archives/2007/10/couchdb-joins
The third and best(?) solution was store each comment as a separate document with a link to the blog post. Using complex keys made it very easy to query for all comments belonging to a post as well as querying for all comments made by a user, even sorted in chronological order.

Related

MongoDb slow aggregation with many collections (lookup)

i'm working on a MEAN stack project, i use too many collections in my aggregation so i use a lot of lookup, and that impacts negatively the performance and makes the execution of aggregation very slow. i was wondering if you have any suggestions , i found that we can reduce lookup by creating for each collection i need an array of objects into a globale collection however, i'm looking for an optimale and secured solution.
As an information, i defined indexes on all collections into mongo.
Thanks for sharing your ideas!
This is a very involved question. Even if you gave all your schemas and queries, it would take too long to answer, and be very specific to your case (ie. not useful to anyone else coming along later).
Instead for a general answer, I'd advise you to read into denormalization and consider some database redesign if this query is core to your project.
Here is a good article to get you started.
Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
A simple example to outline it:
Say you have a Blog with a comment collection, and a user collection
You want to display the comment with the name of the user. So you have to load the player for every comment.
Instead you could save the username on the comment collection as well as the user collection.
Then you will have a fast query to show comments, as you don't need to load the users too. But if the user changes their name, then you will have to update all of the comments with the new name. This is the main tradeoff.
If a DB redesign is too difficult, I suggest splitting into multiple aggregates and combining them in memory (ie. in your node server side code)

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

Applying "tag" to millions of documents, using bulk/update methods

We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.
So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.
While waiting for update by query support, I have opted for:
Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.
Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.
Python snippet to illustrate the approach:
def actiongen():
docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
for doc in docs:
yield {
'_op_type': 'update',
'_index': doc['_index'],
'_type': doc['_type'],
'_id': doc['_id'],
'doc': {'tags': tags},
}
helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
Using the aforementioned update-by-query plugin, you would simply call:
curl -XPOST localhost:9200/index/type/_update_by_query -d '{
"query": {"filtered": {"filter":{
"not": {"term": {"tag": "github"}}
}}},
"script": "ctx._source.label = \"github\""
}'
The update-by-query plugin only accepts a script, not partial documents.
As for performance and memory issues, I guess the best thing is to give it a try.
I'd go with the bulk API with the caveat that you should try to update each document the minimal number of times. Updates are just atomic deletes and adds and leave behind the deleted document as a tombstone until it can be merged out.
Sending a groovy script to execute the update probably makes the most sense here so you don't have to fetch the document first.
Could you create a Parent/Child relationship whereby you can add a 'tags' type which references your 'posts' type as its parent. This way you wouldn't need to perform a full reindex of your data - simply index each of the appropriate tags against the appropriate post ID.
A very old thread. Landed through the github page to implement "update by query" to see if it's implemented in 2.0 but unluckily not. Thanks to plugin from Teka, if the update is small, that very much doable from sense but our use case was to update million of documents daily based on certain complex queries. At the end, we moved to es-hadoop connector. Although infrastructure is a big big overhead here but parallelizing the process of fetching/updating/inserting document through spark helped us anyhow. If anyone has any other suggestion discovered :) in past one year, would love to hear on that.

Database Design for "Likes" in a social network (MongoDB)

I'm building a photo/video sharing social network using MongoDB. The social network has a feed, profiles and a follower model. I basically followed a similar approach to this article for my "social feed" design. Specifically, I used the fan-out on write with bucket approach when users posts stories.
My issue is when a user "likes" a story. I'm currently also using the fan-out on write approach that basically increments/decrements a story's "like count" for every user's feed. I think this might be a bad design since users "like" more frequently than they post. Users can quickly saturate the server by liking and unliking a popular post.
What design pattern do you guys recommend here? Should I use fan-out on read? Keep using Fan-out on write with Background workers? If the solution is "background workers", what approach do you recommend using for background workers? 'm using Node.js.
Any help is appreciated!
Thanks,
Henri
I think the best approach is:
1. increasing-decreasing a counter in your database to keep track of the number of like
2. insert in a collection called 'like' each like as a single document, where you track the id of the users who likes the story and the id of the liked story.
Then if you just need the number of likes you can access the counter data and it's really fast, instead if you need to know where the likes where from you will query the collection called 'like' querying by story id and get all users' ids who liked the story.
The documents i am talking about in the like collection will be like so:
{_id: 'dfggsdjtsdgrhtd'
'story_id': 'ertyerdtyfret',
'user_id': 'sdrtyurertyuwert'}
You can store the counter in the story's document itself:
{
...
likes: 56
}
You can also keep track of last likes in your story's document (for example 1000. last because mongodb's documents have limited size to 16 mb and if your application scales so much you will meet problem in storing potential unlimited data in a single document). With this approach you can easily query the 'likes' collection and get last likes.
When someone unlikes a story you can simply remove the like document from 'like' collection, or, as better approach, (e.g: you are sending notification when someone's story is liked), just store in that document that was unliked, so that if it will be liked again by the same user you will have checked that the like was already inserted and you won't send another notification.
example:
first time insert:
{_id: 'dfggsdjtsdgrhtd'
'story_id': 'ertyerdtyfret',
'user_id': 'sdrtyurertyuwert'
active: true}
When unliked update to this
{_id: 'dfggsdjtsdgrhtd'
'story_id': 'ertyerdtyfret',
'user_id': 'sdrtyurertyuwert'
active: false}
When each like is added check if there's an existing document with the same story id and the same user id. If there is, if active is false it means the user already liked and unliked the story so that if it will be liked again you won't send already-sent notification!

Mongo Schema for Quiz Site

I'm building a small Node/Mongo app that serves users with up to 3 questions per day. Users can only answer yes or no and the correct answer will be determined at a later time (these questions are closer to predictions). Currently, I have these documents:
User
id
Question
id
QuestionAnswer
id
question_id (ref)
UserAnswer
id
question_id (ref)
user_id (ref)
What is the most efficient way to query the db so I get today's questions but also check whether that user has answered that question already? I feel like I'm overthinking it. I've tried a couple ways that seem to be overkill.
It's good to put them all in one schema since we don't have joins in mongodb.
It is faster than using relations.
Also for keeping your query small, take a look at this.
You should stay away from relations till you have a good reason for using them. So, what you need is only one schema.

Resources