cassandra data modeling with denormalization - cassandra

I read cassandra data modeling, everything is clear except that the denormalized data may change. How do I sync it?
What is the way for updating email when users email is changed from this example:
CREATE TABLE groups ( groupname text, username text, email text, age int, hash_prefix int, PRIMARY KEY ((groupname, hash_prefix), username) )
groupname is part of groups, the user from data model may not know any groups, so there is no way to update the email after the user changes.
Is the solution described below is appropriate?
Add to the user model a column groups (type set<text>)
If the user model has a primary key username then I can add some DAOperUser(username) with updateName and addGroup methods to the application.
For every username instantiate own object (through the factory), which will read state from user table on initializing. This way it will have username and groups, so changes can be considered as a write batch for both tables (users and groups).

When inserting or updating data, you need to use BATCH statements to keep the data in sync between the two tables users and groups.
For example:
BEGIN BATCH
INSERT INTO users (...) VALUES (...);
INSERT INTO groups (...) VALUES (...);
APPLY BATCH;
If you're interested, there's a free tutorial on https://www.datastax.com/dev that explains the concepts in detail with hands-on exercise on a Cassandra cluster pre-installed running on the same browser tab -- Atomicity and Batches. Cheers!

Related

How do I model group memberships in Cassandra?

What would be an optimal way to implement group memberships in Cassandra? I need the following operations:
Create a new group with an optional list of initial members.
Delete a group.
Add a user to a group.
Delete a user from a group.
List all members of a group.
List all group memberships.
90%+ (maybe even 99%) of all operations on the DB would be listing all members of a group. There's the choice between using sets for maintaining the memberships of a group vs. making each member of a group a row on its own. The load is fairly significant requiring a large cluster.
Users would be just identified by their ID, which is a short string. Groups would only need to be identified by a string like a UUID. No names or other metadata needed.
The biggest challenge is how to support listing all group memberships efficiently. Any recommendations?
The first principle of data modelling in Cassandra is design a table for each application query. You need to list the application queries first THEN design a table for each of those app queries.
DBAs from a relational background typically do the reverse -- they tend to focus on worrying about how the data is structured on the table then later try to design queries for the app. This however does not work when it comes to Cassandra because the tables end up not being optimised for the app queries.
In your case, the application query looks something like:
Retrieve the list of user IDs for group ID X.
Rewriting this in something more SQL-like:
SELECT userid FROM table WHERE groupid = X
This app query indicates that the table needs to be partitioned by the group ID (groupid) which contains one or more rows of user IDs (userid). The design (schema) for this table looks something like:
CREATE TABLE groups (
groupid uuid,
userid text,
...
PRIMARY KEY (groupid, userid)
)
When you query this table with:
SELECT userid FROM groups WHERE groupid = ?
it will return one or more rows of userid. Cheers!

Cassandra - Shall I have to do so many writes?

I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?
Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.
Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.

How to structure the particion key in a cassandra table for user following

I want to make a table, where users follow other users. In this table I need an userID for the following user and an userID for the followed user. In the case that there are some users, that get followed by hundreds of thousands of users, so it is too large to store them efficiently in a collection.
How can I model it, that all of the followers of a single user don't end up in the same partition? Before realizing this problem I wanted to use either the following user ID oder followed user ID as the partition key, but then there should be a hot-partition problem.
Could I use a composite partition key with following user ID and followed user ID to solve the hot-partition problem? What else could save this problem?
My 2 main queries are to get all following users of an user and get all followed users of an user.
If i user a composite partition key with those 2 IDs, can I even query the things above? Or do I need to query with both IDs and so it is not possible?
Thank you for your help.
This is a common problem in social graphs where certain personalities have millions more followers than everyone else. I tend to use Taylor Swift and Barrack Obama as examples.
A lot of social platforms handle this "super-nodes" issue by isolating them in a separate data store so that the main store doesn't queue up when someone traverses a super-node.
This does mean that you need to handle this in your app such that your app needs to check the user against a reference table of super-users/super-nodes so it then does lookups in sub-table(s) instead of the main table. Cheers!
You should look at bucketing solution. In Bucketing, you introduce an additional key to be a part of your partition key. For example, you can have your data model like this
CREATE TABLE user_followers (
user_id int,
bucket_id int,
follower_id int,
user_name text,
PRIMARY KEY ((user_id, bucket_id)));
Here user_id and bucket_id is the partition key. To find all the partitions you should know your bucket_ids beforehand.

How to model data using Cassandra and Ignite together?

I'm researching how to model data having both Cassandra and Ignite together. So far the basic recommendation of data modeling in Cassandra (coming from this article) is clear: "model data around your queries". An author gives an example of "user lookup". We want to look up for users by their username or their email and according to him the best approach would be having two tables:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
However things get confusing with Ignite on the top of Cassandra. Unfortunately I could not find any helpful examples or answers to the following questions:
Does having multiple tables that store user information mean having Ignite cache for each of these tables?
Does having compound primary key mean introducing new type for each key and use it as Ignite cache key?
Having Ignite means not having direct reads from Cassandra. Does it even make scene to bother modeling data following NoSql best practices? Would it be ok to just have one user table and let Ignite take care of queries by username or email.
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
age int
)
You should probably have one cache per Cassandra table.
If your original key is compound, so should Ignite key be.
You will need to use secondary indexes in Ignite to query by more than one field, and this means you will have to hold all data in Ignite (which is NOT necessary for pure caching scenario). This means enabling readThrough and writeThrough, doing loadCache and always doing all updates through Ignite. You will have to choose between "Ignite as cache for Cassandra" (stick to Cassandra's data layout, can hold partial data) and "Ignite as DB backed by Cassandra" (you can use layout optimal for Ignite, secondary indexes).

Updates in cassandra

Cassandra data modeling respects "Denormalization and duplication of data is a fact of life with Cassandra". But one of the cons for demormalized data is making the updates very hard. For example, if I have three tables catering for different queries, selecting is fine. However, if in my app, I want to update a username and I need to update these three tables? The update on first table looks ok. How about the latter two? The upates are going to be very expensive? How should I handle this case?
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
hash_prefix int,
PRIMARY KEY ((groupname, hash_prefix), username)
)
This is a typical problem I see when people try to put relational model in Cassandra which is being updated through time. Cassandra is a great database and for what it does, it works wonders. There are many features that enable all kinds of different data models and you can cover almost all use cases. When you look at your use case the question is why would you use Cassandra for relational model?
If you really want to make Cassandra cover your use case you will have to do a lot of different operations on application level just to execute updates and keep your data in consistent state.
After watching a few youtube clips, it looks like Canssandra's update is a simple write to append a record to the commit log in the file system. Then the data is put to memtable in cassandra server and send acknowledge to the client straight away. So the update call finishes. This makes the updating fast to the clients.
The whole compaction process happens afterwards, including flushing, sequential writing and merging based on the timestamp.

Resources