cassandra: can you query against a collection field? - cassandra

cassandra: can you query against a collection field?
say if you wanted to keep a friends list in such a field, can you run a query along the lines of: where user_id = xxx and friend = 'bob'?
If a collection is not right for this, What is the proper way to keep track of friends in cassandra?

Secondary indexes are still not yet supported but development is in progress (CASSANDRA-4511)
In your model, if you know the user_id you could fetch the user and check if 'bob' is in their friends list at the application side. If you need to query on whether person_A is friends with person_B you can extract the collection to its own table but this model will require doing 2 queries.
CREATE TABLE friends (
user_id text,
friend text,
PRIMARY KEY(user_id, friend)
);
CREATE INDEX idx_friends on friends (friend);
Say you have a result from the above table like:
user_id | friend
---------+--------
daniel | bob
daniel | jack
jack | bob
If you want to find all the people following bob, you can use SELECT * FROM friends WHERE friend='bob'. You could actually do it too without having the secondary index and using ALLOW FILTERING but this can lead to unpredictable performance:
The ALLOW FILTERING option allows to explicitly allow (some) queries that require filtering. Please note that a query using ALLOW FILTERING may thus have unpredictable performance (for the definition above), i.e. even a query that selects a handful of records may exhibit performance that depends on the total amount of data stored in the cluster.
Docs for ALLOW FILTERING.

Related

Advice for storing votes with Prisma&Mongo

I've been using MongoDB for a few years now, and I'm starting to use Prisma's client, which seems to be providing excellent type safety. I'm not familiar at all with relational data storage, though...
This is how my database is looking:
Users are each associated with a single group
Each one of them can be a candidate in the group, and all vote for one of the group candidates during an election once they've been selected
What best practices would you advise me to follow here to store the votes? (I'm not talking about the privacy perspective specific to voting, this is a whole other topic, but only the performance & data storage stance)
Typically with MongoDB, I would create an embedded object in the Group document, with the User IDs of candidates as the keys, and an array of the User IDs of the voters for each candidate as the values.
But strong types & the relational database-inspired approach of Prisma seem to prevent me from doing anything like that, most likely for the better.
In a relational database, you would typically create a separate table to store the votes. Each row in the table would represent a single vote and would contain the following columns:
group_id: the ID of the group in which the vote took place
candidate_id: the ID of the candidate who received the vote
voter_id: the ID of the user who cast the vote
You can also add a timestamp column to store the date and time when the vote was cast.
To get the votes for a particular group, you would perform a query like this:
SELECT * FROM votes WHERE group_id = :groupId;
You can then use the results of this query to build the necessary data structures in your application.
Using a separate table to store votes has the advantage of keeping the data normalized, which can make it easier to work with and more performant. It also allows you to use foreign key constraints to ensure that each vote references a valid group and candidate.

Search for more than one element in a list in Cassandra

I'm learning how the data model works in Cassandra, what things you can do and what not, etc.
I've seen you can have collections and I'm wondering if you can search for the elements inside the collection. I've seen that you can look for one element with contains, but if you want to look for more than one you need to add more filters, is there any way to do this better? is it a bad practice?.
This my table definition:
CREATE TABLE data (
group_id int,
user timeuuid,
friends LIST<VARCHAR>,
PRIMARY KEY (group_id, user)
);
And this what I know i can use to look for more than one item in the list:
SELECT * FROM groups where friends contains 'bob' and friends contains 'Pete' ALLOW FILTERING;
Thank you
Secondary indexes are generally not recommended for performance reasons.
Generally, in Cassandra, Query based modelling should be followed.
So,
That would mean another table:
CREATE TABLE friend_group_relation (
friend VARCHAR,
group_id int,
<user if needed>
PRIMARY KEY ((friend), group_id)
);
Now you can use either IN query (not recommended) or async queries (strongly recommended, very fast response) on this table.
You can follow 2 different approaches
Pure cassandra: use a secondary index on your collection type as defined here documentation
You may also be able to use Solr and create a query against solr to retrieve your entries. Although this may look like a more complicated solution because it will require to use an extra tool it will avoid using secondary indexes on Cassandra. Secondary indexes on Cassandra are really expensive and based on on your schema definition may impact your performances.

How to auto replicate data in cassandra

I am very new to cassandra and currently in early stage of project where i am studying cassandra.
Now since cassandra says to de-normalize data and replicate it. So, i have a following scenario :
I have table, user_master, for users. A user has
subject [type text]
hobbies [type list]
uid [type int]
around 40 more attributes
Now, a user wants to search for another user. This search should look for all user who matches the subject and hobbies provided by user. For this reason i am planning to make a different table user_discovery which will have following attribute only for every user
subject [type text]
hobbies [type list]
uid [type int]
*other irrelevant attributes won't be part of this table.
Now my question is:
Do i need to write on both tables for every insert/update in user_master? Can updation of user_discovery be automated when their is any insert/update in user_master.
Even after studying a bit, i am still not so much sure that making a separate table would increase the performance.Since, number of users would be same in both table (yes, number of column would be very less in user_discovery). Any comment on this would be highly appreciated.
Thanks
The idea of separate tables for queries is to have the key of the table contain what you are looking for.
You don't say what the key of your second table looks like, but your wording "the following attributes for every user" looks like you plan to have the user (Id?) as key. This would indeed have no performance advantage.
If you want to find users by their hobby make a table having the hobby as key, and the user id (or whatever it is you use to look up users) as columns. Write one row per hobby, listing all users having that hobby. Write the user into every row matching one of his hobbies.
Do the same for the subject (i.e. separate table, subject as key, user ids as columns).
Then, if you want to find a user having a list of specific hobbies, make one query per hobby, creating the intersection of the users.
To use these kind of lookup-tables you would have indeed to update all table every time you update a user.
Disclaimer: I used this kind of approach rather successfully in a relative complex setting managing a few hundred thousand users. However, this was two years ago, on a Cassandra 1.5 system. I haven't really looked into the new features of Cassandra 2.0, so I have no idea whether it would be possible to use a more elegant approach today.

What is the correct data model for storing user relationships in Cassandra (i.e. Bob follows John)

I have a system where actions of users need to be sent to other users who subscribe to those updates. There aren't a lot of users/subscribers at the moment, but it could grow rapidly so I want to make sure I get it right. Is it just this simple?
create table subscriptions (person_uuid uuid,
subscribes_person_uuid uuid,
primary key (person_uuid, subscribes_person_uuid)
)
I need to be able to look up things in both directions, i.e. answer the questions:
Who are Bob's subscribers.
Who does Bob subscribe to
Any ideas, feedback, suggestions would be useful.
Those two queries represent the start of your model:
you want the user to be the PK or part of the PK.
depending on the cardinality of subscriptions/subscribers you could go with:
for low numbers: using a single table and two sets
for high numbers: using 2 tables similar to the one you describe
#Jacob
Your use case looks very similar to the Twitter example, I did modelize it here
If you want to track both sides of relationship, I'll need to have a dedicated table to index them.
Last but not least, depending on the fact that the users are mutable OR not, you can decide to denormalize (e.g. duplicate User content) or just store user ids and then fetch users content in a separated table.
I've implemented simple join feature in Achilles. Have a look if you want to go this way

what's the best way to search a social network by prioritizing a users relationships first?

I have a social network set up and via an api I want to search the entries. The database of the social network is mysql. I want the search to return results in the following format: Results that match the query AND are friends of the user performing the search should be prioritized over results that simply match the query.
So can this be done in one query or will I have to do two separate queries and merge the results and remove duplicates?
I could possibly build up a data structure using Lucene and search that index efficiently, but am wondering if the penalty of updating a document everytime a new relationship is created is going to be too much?
Thanks
The reference to Lucene complicates the equation a little bit. Let's solve it (or at least get a baseline) without it first.
Assuming the following datamodel (or something approaching.
tblUsers
UserId PK
UserName
Age
...
tblBuddies
UserId FK to tblUsers.UserId
FriendId tblUsers.Userid = Id of one of the friends
BuddyRating float 0.0 to 1.0 (or whatever normalized scale) indicating
the level of friendship/similarity/whatever
tblItems
ItemId PK
ItemName
Description
Price
...
tblUsersToItems
UserId FK to tblUsers.UserId
ItemId FK to
ItemRating float 0.0 to 1.0 (or whatever normalized scale) indicating
the "value" assigned to item by user.
A naive query (but a good basis for an optimized one) could be:
SELECT [TOP 25] I.ItemId, ItemName, Description, SUM(ItemRating * BuddyRating)
FROM tblItems I
LEFT JOIN tblUserToItems UI ON I.ItemId = UI.ItemId
LEFT JOIN tblBuddies B ON UI.UserId = B.FriendId
WHERE B.UserId = 'IdOfCurrentUser'
AND SomeSearchCriteria -- Say ItemName = 'MP3 Player'
GROUP BY I.ItemId, ItemName, Description
ORDER BY SUM(ItemRating * BuddyRating) DESC
The idea is that a given item is given more weight if it is recommended/used by a friend. The extra weigh is the more important if the friend is a a close friend [BuddyRating] and/or if the friend recommend this item more strongly [ItemRating]
Optimizing such a query depends on the overal number of item, the average/max numbers of buddies a given user has, the average/max number of items a user may have in his/her list.
Is this type of ideas/info you are seeking or am I missing the question?
One way is to store all your social network graph separately from Lucene. Run your keyword query on Lucene, and also lookup all the friends in your network graph. For all the friends that are returned, boost all of those friends' search results by some factor and resort. This re-sort would be done outside of Lucene. I've done things like this before and it performs pretty well.
You can also create a custom HitCollector that does the boosting as the hits are being collected in Lucene. You'd have to construct a list of internal Lucene ID's that belong to the friends of the current user.
Your social network graph can be stored in Mysql, in memory as a sparse adjacency matrix, or you can take a look at Neo4j.

Resources