How do I model group memberships in Cassandra? - cassandra

What would be an optimal way to implement group memberships in Cassandra? I need the following operations:
Create a new group with an optional list of initial members.
Delete a group.
Add a user to a group.
Delete a user from a group.
List all members of a group.
List all group memberships.
90%+ (maybe even 99%) of all operations on the DB would be listing all members of a group. There's the choice between using sets for maintaining the memberships of a group vs. making each member of a group a row on its own. The load is fairly significant requiring a large cluster.
Users would be just identified by their ID, which is a short string. Groups would only need to be identified by a string like a UUID. No names or other metadata needed.
The biggest challenge is how to support listing all group memberships efficiently. Any recommendations?

The first principle of data modelling in Cassandra is design a table for each application query. You need to list the application queries first THEN design a table for each of those app queries.
DBAs from a relational background typically do the reverse -- they tend to focus on worrying about how the data is structured on the table then later try to design queries for the app. This however does not work when it comes to Cassandra because the tables end up not being optimised for the app queries.
In your case, the application query looks something like:
Retrieve the list of user IDs for group ID X.
Rewriting this in something more SQL-like:
SELECT userid FROM table WHERE groupid = X
This app query indicates that the table needs to be partitioned by the group ID (groupid) which contains one or more rows of user IDs (userid). The design (schema) for this table looks something like:
CREATE TABLE groups (
groupid uuid,
userid text,
...
PRIMARY KEY (groupid, userid)
)
When you query this table with:
SELECT userid FROM groups WHERE groupid = ?
it will return one or more rows of userid. Cheers!

Related

cassandra data modeling with denormalization

I read cassandra data modeling, everything is clear except that the denormalized data may change. How do I sync it?
What is the way for updating email when users email is changed from this example:
CREATE TABLE groups ( groupname text, username text, email text, age int, hash_prefix int, PRIMARY KEY ((groupname, hash_prefix), username) )
groupname is part of groups, the user from data model may not know any groups, so there is no way to update the email after the user changes.
Is the solution described below is appropriate?
Add to the user model a column groups (type set<text>)
If the user model has a primary key username then I can add some DAOperUser(username) with updateName and addGroup methods to the application.
For every username instantiate own object (through the factory), which will read state from user table on initializing. This way it will have username and groups, so changes can be considered as a write batch for both tables (users and groups).
When inserting or updating data, you need to use BATCH statements to keep the data in sync between the two tables users and groups.
For example:
BEGIN BATCH
INSERT INTO users (...) VALUES (...);
INSERT INTO groups (...) VALUES (...);
APPLY BATCH;
If you're interested, there's a free tutorial on https://www.datastax.com/dev that explains the concepts in detail with hands-on exercise on a Cassandra cluster pre-installed running on the same browser tab -- Atomicity and Batches. Cheers!

Cassandra changing Primary Key vs Firing multiple select queries

I have a table that stores list products that a user has. The table looks like this.
create table my_keyspace.userproducts{
userid,
username,
productid,
productname,
producttype,
Primary Key(userid)
}
All users belong to a group, there could be min 1 to max 100 users in a group
userid|groupid|groupname|
1 |g1 | grp1
2 |g2 | grp2
3 |g3 | grp3
We have new requirement to display all products for all users in a single group.
So do i change my userproducts so that my Partition Key is now groupid and make userid as my cluster key, so that i get all my results in one single query.
Or do I keep my table design as it is and fire multiple select queries by selecting all users in a group from second table and then fire one select query for each user, consolidate data in my code and then return it to the users
Thanks.
Even before getting to your question, your data modelling as you presented it has a problem: You say that you want to store "a list products that a user has". But this is not what the table you presented has - your table has a single product for each userid. The "userid" is the key of your table, and each entry in the table, i.e, each unique userid, has one combination of the other fields.
If you really want each user to have a list of products, you need the primary key to be (userid, productid). This means that each record is indexed by both a userid and a productid, or in other words - a userid has a list of records each with its own productid. Cassandra allows you to efficiently fetch all the productid records for a single userid because it implements the first part of the key as a "partition key" but the second part is a "clustering key".
Regarding your actual question, you indeed have two options: Either do multiple queries on your original tables, or do so-called denormalization, i.e., create a second table with exactly what you want searchable immediately. For the second option you can either do it manually (update both tables every time you have new data), or let Cassandra update the second table for you automatically, using a feature called Materialized Views.
Which of the two options - multiple queries or multiple updates - to use really depends on your workload. If it has many updates and rare queries, it is better to leave updates quick and make queries slower. If, on the other hand, it has few updates but many queries, it is better to make updates slower (when each update needs to update both tables) but make queries faster. Another important issue is how much query latency is important for you - the multiple queries option not only increases the load on the cluster (which you can solve by throwing more hardware at the problem) but also increases the latency - a problem which does not go away with more hardware and for some use cases may become a problem.
You can also achieve a similar goal in Cassandra by using the Secondary Index feature, which has its own performance characteristics (in some respects it is similar to the "multiple queries" solution).

Audit Trail Design using Table Storage

I'm considering implementing an Audit Trail for my application in using Table Storage.
I need to be able to log all actions for a specific customer and all actions for entities from that customer.
My first guess was creating a table for each customer (Audits_CustomerXXX) and use as a partition key the entity id and row key the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") value. And this works great when my question is what happened to certain entity? For instance the audit of purchase would have PartitionKey = "Purchases/12345" and the RowKey as the timestamp.
But when I want a birds eye view from the entire customer, can I just query the table sorting by row key across partitions? Or is it better to create a secondary table to hold the data with different partition keys? Also when using the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") is there a way to prevent errors when two actions in the same partition happen in the same tick (unlikely but who knows...).
Thanks
You could certainly create a separate table for the birds eye view but you really don't have to. Considering Azure Tables are schema-less, you can keep this data in the same table as well. You would keep the PartitionKey as reverse ticks and RowKey as entity id. Because you would be querying only on PartitionKey, you can also keep RowKey as GUID as well. This will ensure that all entities are unique. Or you could append a GUID to your entity id and use that as RowKey.
However do keep in mind that because you're inserting two entities with different PartitionKey values, you will have to safegaurd your code against possible network failures as each entry will be a separate request to Table service. The way we're handling this in our application is we write this payload to a queue message and then process that message through a background process.

DynamoDB: Store array of IDs and do a batchGet for each ID

In DynamoDB, I have a Groups table and a Users table. One or many Users can belong to a Group.
Using DynamoDB, is it possible to perform one query to get a single Group by ID, and also all of the Users in that Group by the User IDs in that Group record?
If not, what is the most efficient way to do this?
No, you cannot do JOINs in NoSQL databases. The way you can do is to retrieve your group. read all the userIds. And then use either batchGet or query/scan(if its primary index) using "IN" operator

Storing list in cassandra

I want to save a friends list in Cassandra where a user may have few hundred of friends . Should i store the list of friends, which is an email id, as a list or set in Cassandra or should i create a separate table having the columns user_id and friends which will include all the user(millions of users) along with their friends .
If i create a separate table with user_id and friends column will there be degradation in performance while retrieving the entire friend list of the user/ one friend of the user as the table will contain many records/rows.
It is important to note that lists and sets in Cassandra are not iterable. This means when you query for them, you get back the whole list or the whole set. If the collection has a high cardinality then this could pose issues in querying such as read timeouts or even a heap OOM error.
Since it sounds like there is no cap on the amount of friends one can have, one option could be to have a separate table that is partitioned on user and clustered on friend.
CREATE TABLE user_friends (
owner_user_id int,
friend_user_id int,
PRIMARY KEY(owner_user_id, friend_user_id)
);
This will ensure that the friend_user_id is in order and will allow you to do client side paging if the number of friends is very large. It also allows for a quick way to check if a person is a friend of a user.

Resources