How will one implement like and dislike counts, for a post, in Couchdb/Couchbase [closed] - couchdb

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
How do one implement live like and dislike [or say views count] in couchdb/couchbase in the most efficient way.
Yeah one can use reduce to calculate count each time and on front end only use increment and decrement to one API call to get results.
But for every post there will be say millions of views, like and dislikes.
If we will have millions of such post [in a social networking site], the index will be simply too big.

In terms of Cloudant, the described use case requires a bit of care:
Fast writes
Ever-growing data set
Potentially global queries with aggregations
The key here is to use an immutable data model--don't update any existing documents, only create new ones. This means that you won't have to suffer update conflicts as the load increases.
So a post is its own document in one database, and the likes stored separately. For likes, you have a few options. The classic CouchDB solution would be to have a separate database with "likes" documents containing the post id of the post they refer to, with a view emitting the post id, aggregated by the built-in _count. This would be a pretty efficient solution in this case, but yes, indexes do occupy space on Couch-like databases (just like as with any other database).
Second option would be to exploit the _id field, as this is an index you get for free. If you prefix the like-documents' ids with the liked post's id, you can do an _all_docs query with a start and end key to get all the likes for that post.
Third - recent CouchDBs and Cloudant has the concept of partitioned databases, which very loosely speaking can be viewed as a formalised version of option two above, where you nominate a partition key which is used to ensure a degree of storage locality behind the scenes -- all documents within the same partition are stored in the same shard. This means that it's faster to retrieve -- and on Cloudant, also cheaper. In your case you'd create a partitioned "likes" database with the partition key being the post-id. Glynn Bird wrote up a great intro to partitioned DBs here.
Your remaining issue is that of ever-growth. At Cloudant, we'd expect to get to know you well once your data volume goes beyond single digit TBs. If you'd expect to reach this kind of volume, it's worth tackling that up-front. Any of the likes schemes above could most likely be time-boxed and aggregated once a quarter/month/week or whatever suits your model.

Related

rdb vs key-value store for django functionality [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
When would one choose a key-value data store over a relational DB? What considerations go into deciding one or the other? When is mix of both the best route? Please provide examples if you can.
Key-value, heirarchical, map-reduce, or graph database systems are much closer to implementation strategies, they are heavily tied to the physical representation. The primary reason to choose one of these is if there is a compelling performance argument and it fits your data processing strategy very closely. Beware, ad-hoc queries are usually not practical for these systems, and you're better off deciding on your queries ahead of time.
Relational database systems try to separate the logical, business-oriented model from the underlying physical representation and processing strategies. This separation is imperfect, but still quite good. Relational systems are great for handling facts and extracting reliable information from collections of facts. Relational systems are also great at ad-hoc queries, which the other systems are notoriously bad at. That's a great fit in the business world and many other places. That's why relational systems are so prevalent.
If it's a business application, a relational system is almost always the answer. For other systems, it's probably the answer. If you have more of a data processing problem, like some pipeline of things that need to happen and you have massive amounts of data, and you know all of your queries up front, another system may be right for you.
If your data is simply a list of things and you can derive a unique identifier for each item, then a KVS is a good match. They are close implementations of the simple data structures we learned in freshman computer science and do not allow for complex relationships.
A simple test: can you represent your data and all of its relationships as a linked list or hash table? If yes, a KVS may work. If no, you need an RDB.
You still need to find a KVS that will work in your environment. Support for KVSes, even the major ones, is nowhere near what it is for, say, PostgreSQL and MySQL/MariaDB.
IMO, Key value pair (e.g. NoSQL databases) works best when the underlying data is unstructured, unpredictable, or changing often. If you don't have structured data, a relational database is going to be more trouble than its worth because you will need to make lots of schema changes and/or jump through hoops to conform your data to the structure.
KVP / JSON / NoSql is great because changes to the data structure do not require completely refactoring the data model. Adding a field to your data object is simply a matter of adding it to the data. The other side of the coin is there are fewer constraints and validation checks in a KVP / Nosql database than a relational database so your data might get messy.
There are performance and space saving benefits for relational data models. Normalized relational data can make understanding and validating the data easier because there are table key relationships and constraints to help you out.
One of the worst patterns i've seen is trying to have it both ways. Trying to put a key-value pair into a relational database is often a recipe for disaster. I would recommend using the technology that suits your data foremost.
If you want O(1) lookups of values based on keys, then you want a KV store. Meaning, if you have data of the form k1={foo}, k2={bar}, etc, even when the values are larger/ nested structures, and want fast lookups, you want a KV store.
Even with proper indexing, you cannot achieve O(1) lookups in a relational DB for arbitrary keys. Sometimes this is referred to as "random lookups".
Alliteratively stated, if you only ever query by one column, a "primary key" if you will, to retrieve the rest of the data, then using that column as a keyspace and the rest of the data as a value in a KV store is the most efficient way to do lookups.
In contrast, if you often query the data by any of several columns, aka you support a richer query API for the data, then you may want a relational database.
A traditional relational database has problems scaling beyond a point. Where that point is depends a bit on what you are trying to do.
All (most?) of the suppliers of cloud computing are providing key-value data stores.
However, if you have a reasonably sized application with a complicated data structure, then the support that you get from using a relational database can reduce your development costs.
In my experience, if you're even asking the question whether to use traditional vs esoteric practices, then go traditional. While esoteric practices are sexy, challenging, and fun, 99.999% of applications call for a traditional approach.
With regards to relational vs KV, the question you should be asking is:
Why would I not want to use a relational model for this scenario: ...
Since you have not described the scenario, it's impossible for anyone to tell you why you shouldn't use it. The "catch all" reason for KV is scalability, which isn't a problem now. Do you know the rules of optimization?
Don't do it.
(for experts only) Don't do it now.
KV is a highly optimized solution to scalability that will most likely be completely unecessary for your application.

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

Can Druid replace Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I cant help think that there aren't many use case that can be effectively served by Cassandra better than Druid. As a time series store or key value, queries can be written in Druid to extract data however needed.
The argument here is more around justifying Druid than Cassandra.
Apart from the Fast writes in Cassandra, is there really anything else ? Esp given the real time aggregations/and querying capabilities of Druid, does it not outweigh Cassandra.
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
Not at all, they are not comparable. We are talking about two very different technologies here. Easy way is to see Cassandra as a distributed storage solution, but Druid a distributed aggregator (i.e. an awesome open-source OLAP-like tool (: ). The post you are referring to, in my opinion, is a bit misleading in the sense that it compares the two projects in the world of data mining, which is not cassandra's focus.
Druid is not good at point lookup, at all. It loves time series and its partitioning is mainly based on date-based segments (e.g. hourly/monthly etc. segments that may be furthered sharded based on size).
Druid pre-aggregates your data based on pre-defined aggregators -- which are numbers (e.g. Sum the number of click events in your website with a daily granularity, etc.). If one wants to store a key lookup from a string to say another string or an exact number, Druid is the worst solution s/he can look for.
Not sure this is really a SO type of question, but the easy answer is that it's a matter of use case. Simply put, Druid shines when it facilitates very fast ad-hoc queries to data that has been ingested in real time. It's read consistent now and you are not limited by pre-computed queries to get speed. On the other hand, you can't write to the data it holds, you can only overwrite.
Cassandra (from what I've read; haven't used it) is more of an eventually consistent data store that supports writes and does very nicely with pre-compute. It's not intended to continuously ingest data while providing real-time access to ad-hoc queries to that same data.
In fact, the two could work together, as has been proposed on planetcassandra.org in "Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine!".
It depends on the use case . For example I was using Cassandra for aggregation purpose i.e. stats like aggregated number of domains w.r.t. users ,departments etc . Events trends (bandwidth,users,apps etc ) with configurable time windows . Replacing Cassandra with Druid worked out very well for me because druid is super efficient with aggregations .On the other hand if you need timeseries data with eventual consistency Cassandra is better ,Where you can get details of the events .
Combination of Druid and Elasticsearch worked out very well to remove Cassandra from our Big Dada infrastructure
.

windows azure table storage for big load [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What load can azure table storage handle at most (one account)? For example can it handle 2000 reads/sec, where response must come in less than a second (requests would be made from many different machines and the payload of one entity is something like 500Kb on average)? What are the practices to accommodate for such load (how many tables, partitions, giving that there is only one type of entity and in principle there could be any number of table/partitions. Also the Rowkeys are uniformly distributed 32 character hash strings and PartitionKeys are also uniformly distributed).
Check the Azure Storage Scalability and Performance Targets documentation page. That should answer part of your question.
http://msdn.microsoft.com/en-us/library/azure/dn249410.aspx
I would suggest reading the best practices here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
The following are the scalability targets for a single storage account:
•Transactions – Up to 5,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
◦Up to 500 entities per second
◦Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target).
As long as you correctly partition your data so you don't have a bunch of data all going to one machine, one table should be fine. Also keep in mind how you will query the data, if you don't use the index (PartitionKey|RowKey) it will have to do a full table scan which is very expensive with a large dataset.

How to Structure Mongo Database for REST API [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm writing a REST API with Mongo and was intrigued by the whole document modeling strategy. It seems like a very divisive issue where people say to denormalize first, then normalize or vice-versa.
I'm interested to see how the resource structure of a REST api influences the structure of a document-based db. It seems like with a REST api resource structure, it almost makes sense to have separate collections for everything (i.e. locations, tenants, transactions) although this seems like it would be working against one of Mongo's benefits.
My question is how would you model the resources of a REST api in a NoSQL (specifically Mongo) document database.
The answer is, there are many ways, depending on what you want to optimize on.
Generally, the defining of your document schemas and separation of collections will depend on what your specific use cases for the documents are - how will you consume your data?
One big concept to remember, is that "Joins" between collections are costly - basically you're getting a foreign key from one collection and doing a whole other lookup in another collection, which is why de-normalization generally helps performance - if it matches your use cases. This is where MongoDB has the potential to shine, although in the future if your requirements change, your document structures could potentially need to change dramatically.
A second key consideration, is the MongoDB document size limit - roughly 16MB last time I checked.
Take your classic blog website example, with a blog posts collection. We can choose to store comments as subdocuments, as an array in the post document. This way, you could have a rest API for /posts/postID, returning you the post document without having to do any "joins" or lookups in other collections for comments and so forth. But then you run into problems, if you have posts with humongous amounts of comments on it, so in that case, you would have to normalize your data by separating out comments into another collection.
So, speed / ease of retrieval from the database and the flexibility of your document storage - should you need to change a document's schema structure for the future, are two major considerations you should think about as you plan a project API out.
Ask yourself, how does document/collection X going to be used? When would you need to retrieve data from it? If one resource tenants has a "parent resource" location, and accessing location is the only time you actually need tenants, then by all means you could potentially design the storage of tenants into the schema of location. But if you need to be able to query tenants by themselves, then you probably want to break tenants out into their own collection. So there's no right or wrong ways to go about it, just base your planning on how you plan to consume your data!
Good luck!

Resources