What is the correct data model for storing user relationships in Cassandra (i.e. Bob follows John) - cassandra

I have a system where actions of users need to be sent to other users who subscribe to those updates. There aren't a lot of users/subscribers at the moment, but it could grow rapidly so I want to make sure I get it right. Is it just this simple?
create table subscriptions (person_uuid uuid,
subscribes_person_uuid uuid,
primary key (person_uuid, subscribes_person_uuid)
)
I need to be able to look up things in both directions, i.e. answer the questions:
Who are Bob's subscribers.
Who does Bob subscribe to
Any ideas, feedback, suggestions would be useful.

Those two queries represent the start of your model:
you want the user to be the PK or part of the PK.
depending on the cardinality of subscriptions/subscribers you could go with:
for low numbers: using a single table and two sets
for high numbers: using 2 tables similar to the one you describe

#Jacob
Your use case looks very similar to the Twitter example, I did modelize it here
If you want to track both sides of relationship, I'll need to have a dedicated table to index them.
Last but not least, depending on the fact that the users are mutable OR not, you can decide to denormalize (e.g. duplicate User content) or just store user ids and then fetch users content in a separated table.
I've implemented simple join feature in Achilles. Have a look if you want to go this way

Related

Advice for storing votes with Prisma&Mongo

I've been using MongoDB for a few years now, and I'm starting to use Prisma's client, which seems to be providing excellent type safety. I'm not familiar at all with relational data storage, though...
This is how my database is looking:
Users are each associated with a single group
Each one of them can be a candidate in the group, and all vote for one of the group candidates during an election once they've been selected
What best practices would you advise me to follow here to store the votes? (I'm not talking about the privacy perspective specific to voting, this is a whole other topic, but only the performance & data storage stance)
Typically with MongoDB, I would create an embedded object in the Group document, with the User IDs of candidates as the keys, and an array of the User IDs of the voters for each candidate as the values.
But strong types & the relational database-inspired approach of Prisma seem to prevent me from doing anything like that, most likely for the better.
In a relational database, you would typically create a separate table to store the votes. Each row in the table would represent a single vote and would contain the following columns:
group_id: the ID of the group in which the vote took place
candidate_id: the ID of the candidate who received the vote
voter_id: the ID of the user who cast the vote
You can also add a timestamp column to store the date and time when the vote was cast.
To get the votes for a particular group, you would perform a query like this:
SELECT * FROM votes WHERE group_id = :groupId;
You can then use the results of this query to build the necessary data structures in your application.
Using a separate table to store votes has the advantage of keeping the data normalized, which can make it easier to work with and more performant. It also allows you to use foreign key constraints to ensure that each vote references a valid group and candidate.

How to denormalize deep hierarchies?

I’ve read quite a lot about Cassandra and the art of denormalization and materialization while writing the data. I think I understand the concept, and it seems to make sense. However, I am having some trouble implementing it in scenarios where there is a deep hierarchical data structure.
Consider the contrived domain where
Owner 1:* Company
Company 1:* Teams
Team 1:* Players
Players 1:* Equipment
We have tables for each of these entities, but we would also like to query quickly for equipment attributes by owner so it seems the thing to do is create a table (OwnerEquipment) that has the owner id and the equipment id as the primary key with the owner id as the partition key. This makes sense, but what if the UX scenarios that add and edit equipment do not include the owner’s id as part of the working set?
Most of the denormalization examples I’ve encountered in my research are usually a single level parent-child or master-detail type use case. It seems pretty reasonable that an updating client would have enough information about the immediate parent when updating the child to write the denormalized reverse index, but what if the data you would really like to denormalize by is several “joins” away?
This problem is compounded further in our example when we consider a Company is sold to a different Owner. Assume that the desired behavior is for OwnerEquipment to reflect this change. How should the code that writes this updated Company to the database handle the OwnerEquipment table updates? Should it, knowing the ID of the old owner, try to update all the OwnerEquipment records for that owner? This seems like a very un-Cassandra-y thing to do and also fraught with concurrency issues. The problem gets worse as you move down the chain (Team to new Company, Player to new Team). In these cases the “old owner” is not necessarily in the working set and would need to be read in order to be updated.
Are there some better ways to think about this problem?
This makes sense, but what if the UX scenarios that add and edit equipment do not include the owner’s id as part of the working set?
Easy, pass the owner id along with equipment id to the UX. Owner id can be a hidden value not to be shown on the interface
but what if the data you would really like to denormalize by is several “joins” away?
Create as many tables for different query use-cases
For multiple updates and denormalizations, you can look at the new materialized views feature. Read my blog: www.doanduyhai.com/blog/?p=1930

How to auto replicate data in cassandra

I am very new to cassandra and currently in early stage of project where i am studying cassandra.
Now since cassandra says to de-normalize data and replicate it. So, i have a following scenario :
I have table, user_master, for users. A user has
subject [type text]
hobbies [type list]
uid [type int]
around 40 more attributes
Now, a user wants to search for another user. This search should look for all user who matches the subject and hobbies provided by user. For this reason i am planning to make a different table user_discovery which will have following attribute only for every user
subject [type text]
hobbies [type list]
uid [type int]
*other irrelevant attributes won't be part of this table.
Now my question is:
Do i need to write on both tables for every insert/update in user_master? Can updation of user_discovery be automated when their is any insert/update in user_master.
Even after studying a bit, i am still not so much sure that making a separate table would increase the performance.Since, number of users would be same in both table (yes, number of column would be very less in user_discovery). Any comment on this would be highly appreciated.
Thanks
The idea of separate tables for queries is to have the key of the table contain what you are looking for.
You don't say what the key of your second table looks like, but your wording "the following attributes for every user" looks like you plan to have the user (Id?) as key. This would indeed have no performance advantage.
If you want to find users by their hobby make a table having the hobby as key, and the user id (or whatever it is you use to look up users) as columns. Write one row per hobby, listing all users having that hobby. Write the user into every row matching one of his hobbies.
Do the same for the subject (i.e. separate table, subject as key, user ids as columns).
Then, if you want to find a user having a list of specific hobbies, make one query per hobby, creating the intersection of the users.
To use these kind of lookup-tables you would have indeed to update all table every time you update a user.
Disclaimer: I used this kind of approach rather successfully in a relative complex setting managing a few hundred thousand users. However, this was two years ago, on a Cassandra 1.5 system. I haven't really looked into the new features of Cassandra 2.0, so I have no idea whether it would be possible to use a more elegant approach today.

How to perform intersection operation on two datasets in Key-Value store?

Let's say I have 2 datasets, one for rules, and the other for values.
I need to filter the values based on rules.
I am using a Key-Value store (couchbase, cassandra etc.). I can use multi-get to retrieve all the values from one table, and all rules for the other one, and perform validation in a loop.
However I find this is very inefficient. I move massive volume of data (values) over the network, and the client busy working on filtering.
What is the common pattern for finding the intersection between two tables with Key-Value store?
The idea behind the nosql data model is to write data in a denormalized way so that a table can answer to a precise query. To make an example imagine you have reviews made by customers on shops. You need to know the reviews made by a user on shops and also reviews received by a shop. This would be modeled using two tables
ShopReviews
UserReviews
In the first table you query by shop id in the second by user id but data are written twice and accessed directly using just a key access.
In the same way you should organize values by rules (can't be more precise without knowing what's the relation between them) and so on. One more consideration: newer versions of nosql db supports collections which might help to model 1 to many relations.
HTH, Carlo

Understanding Kohana ORM Relationships

I know this question has been asked a million times, but I can't seem to find one that really gives me a good understanding of how relationships work in Kohana's ORM Module.
I have a database with 5 tables:
approved_submissions
-submission_id
-contents
favorites
-user_id
-submission_id
ratings
-user_id
-submission_id
-rating
users
-user_id
votes
-user_id
-submission_id
-vote
Right now, favorites,ratings, and votes have a Primary Key that consists of every column in the table, so as to prevent a user favoriting the same submission_id multiple times, a user voting on the same submission_id multiple times etc. I also believe these fields are set up using foreign keys that reference approved_submissions and users so as to prevent invalid data existing in the respective fields.
Using the DB module, I can access and update these tables no problem. I really feel as though ORM may offer a more powerful and accessible way to accomplish the same things using less code.
Can you demonstrate how I might update a user voting on a submission_id? A user removing a favorite submission_id? A user changing their rating on a particular submission_id?
Also, do I need to make changes to my database structure or is it okay the way it is?
You're probably looking for has_many_through relationships.
So to add a new submission, you'd do something like
$user->add('submissions', $submission);
and to remove
$user->remove('submissions', $submission);
You may want to consider restructuring your database table and key names so you don't end up doing a lot of configuration.

Resources