Is it preferable to have high activity fields in Cassandra exist in their own table? - cassandra

Let's say I'm implementing a forum system (let's think something like Reddit or even SO) that's backed by Cassandra.
A post has multiple fields, like content, timestamp, etc, plus a rating (upvotes plus downvotes). Posts are backed by a POSTS table. Let's assume for the sake of the argument that I don't care to know which specific users did upvote or downvote, I just care about a post's total rating.
I'm wondering if there's any advantage in storing the ratings in a RATINGS (post_id, rating) table instead of just having it as a field in POSTS, given that there are going to be lots of upvotes / downvotes happening all the time.
Given Cassandra's architecture, what would be the ins and outs of choosing one approach over the other?

Putting rating in another table will not make sense as it appears you will be using the same partition key for both tables (POSTS and RATINGS). You can always get ratings from POSTS table. I don't see any benefit in creating RATINGS table.

Related

Solr sort and limit the results of a sub-query

The bounty expires in 3 days. Answers to this question are eligible for a +100 reputation bounty.
Jing is looking for an answer from a reputable source.
I am using Solr as my search engine and what I want to do is to sort and limit the result of a subquery. For example, let's say I have a Amazon product review datasets and I want to get all the products with title containing "iphone" OR products in the smart-phone category.
I'd write solr query something like: "name:iphone OR category:smartphone". However, the problem with this is that there are too many products that are in the category of "smartphone". So I want to limit to only popular products where the popularity is defined by something like a reviewCount. So what I'd like is, for the second subquery, sort the results of that sub-query based on reviewCount and then only take topK. That is, I want to something like:
name:iphone OR (category:smartphone AND sort:reviewCount desc AND rows=100)
So that I can get the products that are "iphone" OR top-100 popular smart phones.
Does Solr support something like this ?
I'm sorry to tell you that this is not possible. Lucene-based search engines spread indexes over multiple shards. Every shard then calculates matches and scores independently. At the very end, the results become aggregated and the number of result rows is cropped. That's why subqueries do not exist here. You can only boost on the score (which should be preferred over sorting) or make two parallel requests and combine the results on the client side (which should be fairly easy with your example).

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

Is the twissandra data model efficient one ?

help me please,
I am new in cassandra world, so i need some advice.
I am trying to make data model for cassandra DB.
In my project i have
- users which can follow each other,
- articles which can be related with many topics.
Each user can follow many topics.
So the goal is make the aggregated feed where user will get:
articles from all topics which he follow +
articles from all friends which he follow +
self articles.
I have searched about same tasks and found twissandra example project.
As i understood in that example we storing only ids of tweets in timeline, and when we need to get timeline we getting ids of tweets and then getting each tweet by id in separate non blocking request. After collecting all tweets we returning list of tweets to user.
So my question is: is it efficient ?
Making ~41 requests to DB for getting one page of tweets ?
And second question is about followers.
When someone creating tweet we getting all of his followers and putting tweet id to their timeline,
but what if user have thousands of followers ?
It means that for creating only one tweet we should write (1+followers_count) times to DB ?
twissandra is more a toy example. It will work for some workloads, but you possibly have more you need to partition the data more (break up huge rows).
Essentially though yes, it is fairly efficient - it can be made more so by including the content in the timeline, but depending on requirements that may be a bad idea (if need deleting/editing). The writes should be a non-issue, 20k writes/sec/node is reasonable providing you have adequate systems.
If I understand your use case correctly, you will probably be good with twissandra like schema, but be sure to test it with expected workloads. Keep in mind at a certain scale everything gets a little more complicated (ie if you expect millions of articles you will need further partitioning, see https://academy.datastax.com/demos/getting-started-time-series-data-modeling).

How to auto replicate data in cassandra

I am very new to cassandra and currently in early stage of project where i am studying cassandra.
Now since cassandra says to de-normalize data and replicate it. So, i have a following scenario :
I have table, user_master, for users. A user has
subject [type text]
hobbies [type list]
uid [type int]
around 40 more attributes
Now, a user wants to search for another user. This search should look for all user who matches the subject and hobbies provided by user. For this reason i am planning to make a different table user_discovery which will have following attribute only for every user
subject [type text]
hobbies [type list]
uid [type int]
*other irrelevant attributes won't be part of this table.
Now my question is:
Do i need to write on both tables for every insert/update in user_master? Can updation of user_discovery be automated when their is any insert/update in user_master.
Even after studying a bit, i am still not so much sure that making a separate table would increase the performance.Since, number of users would be same in both table (yes, number of column would be very less in user_discovery). Any comment on this would be highly appreciated.
Thanks
The idea of separate tables for queries is to have the key of the table contain what you are looking for.
You don't say what the key of your second table looks like, but your wording "the following attributes for every user" looks like you plan to have the user (Id?) as key. This would indeed have no performance advantage.
If you want to find users by their hobby make a table having the hobby as key, and the user id (or whatever it is you use to look up users) as columns. Write one row per hobby, listing all users having that hobby. Write the user into every row matching one of his hobbies.
Do the same for the subject (i.e. separate table, subject as key, user ids as columns).
Then, if you want to find a user having a list of specific hobbies, make one query per hobby, creating the intersection of the users.
To use these kind of lookup-tables you would have indeed to update all table every time you update a user.
Disclaimer: I used this kind of approach rather successfully in a relative complex setting managing a few hundred thousand users. However, this was two years ago, on a Cassandra 1.5 system. I haven't really looked into the new features of Cassandra 2.0, so I have no idea whether it would be possible to use a more elegant approach today.

What is the correct data model for storing user relationships in Cassandra (i.e. Bob follows John)

I have a system where actions of users need to be sent to other users who subscribe to those updates. There aren't a lot of users/subscribers at the moment, but it could grow rapidly so I want to make sure I get it right. Is it just this simple?
create table subscriptions (person_uuid uuid,
subscribes_person_uuid uuid,
primary key (person_uuid, subscribes_person_uuid)
)
I need to be able to look up things in both directions, i.e. answer the questions:
Who are Bob's subscribers.
Who does Bob subscribe to
Any ideas, feedback, suggestions would be useful.
Those two queries represent the start of your model:
you want the user to be the PK or part of the PK.
depending on the cardinality of subscriptions/subscribers you could go with:
for low numbers: using a single table and two sets
for high numbers: using 2 tables similar to the one you describe
#Jacob
Your use case looks very similar to the Twitter example, I did modelize it here
If you want to track both sides of relationship, I'll need to have a dedicated table to index them.
Last but not least, depending on the fact that the users are mutable OR not, you can decide to denormalize (e.g. duplicate User content) or just store user ids and then fetch users content in a separated table.
I've implemented simple join feature in Achilles. Have a look if you want to go this way

Resources