How does this primary key fetch all the users in the group? - cassandra

I'm trying to understand data modeling in Cassandra coming from a relational background using this article.However, I fail to understand one of the examples.
In Example 2 User Groups:
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
PRIMARY KEY (groupname, username)
)
Note that the PRIMARY KEY has two components: groupname, which is the
partitioning key, and username, which is called the clustering key.
This will give us one partition per groupname. Within a particular
partition (group), rows will be ordered by username. Fetching a group
is as simple as doing the following:
SELECT * FROM groups WHERE groupname = ?
However, what I fail to understand is, if we were to create a group, we'd be be passing a single group name and corresponding user name in the insert.
So, how would it be possible to retrieve all the users belonging to a single group using the select statement? Also, since the groupname is the primary key, we can't add more users with the same groupname as it would lead to a violation.

You can think of a partition as a data bucket. It can hold a single row or multiple rows of data. When you read that data bucket, Cassandra can very efficiently access all the rows within the bucket, or just a range of rows you specify by the clustering key.
A partition is the unit of replication within Cassandra, so all the data within one partition bucket is stored on a single node (with possibly extra copies on other nodes if you use a higher replication factor than one).
But the partition key is only part of the key. Each row in the bucket still needs to have a unique primary key, so in that example, each user you stored in a particular group partition would need to have a different user name. So it is the combination of groupname and username that needs to be unique. You can always insert more users under the same groupname as long as each username within the group is different. If you inserted with a duplicate username, then it would be an update to the row with that username instead of adding a row.

Related

Will cassandra do multi partition search for compounded primary key

I have a scenario, let's say below is my cassandra table
CREATE TABLE USER (
id TEXT,
name TEXT,
age int,
role TEXT,
PRIMARY KEY ((id, role), age));
Now I should be able to query user table using either id or role or both id and role. My question is when I use only id or role in the WHERE clause to find user, in this case will cassandra search for user record in different partition(and nodes)? As I am not searching user using both id and role which make the PK of my table.
When you use a compound partition key like in your example PRIMARY KEY ((id, role), age)
Cassandra will concatenate the two values together. It's a technique used to create a more unique or sometimes granular partition key to better control how evenly data gets distributed around the respective datacenter.
Because id and role are concatenated then hashed, you must always provide both the id AND role. Cassandra will not know what to do if you give only part of the compound partition key.

Cassandra - difference in efficiency between simple and compound key

I have a problem with understanding a one thing from this article - http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
Exercise - We want get all users by groupname.
Solution:
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
PRIMARY KEY (groupname, username)
);
SELECT * FROM groups WHERE groupname = 'footballers';
But to find all users in group we can set: PRIMARY KEY (groupname) and it work's also.
Why is needed in this case a clustering key (username)? I know that when we set username as the clustering key we can use it in a WHERE clause. But to find users only by groupname is any difference between PRIMARY KEY (groupname) and PRIMARY KEY (groupname, username) in terms of query efficiency?
Clustering keys provide multiple benefits: Query flexibility, result set ordering (within a partition key) and extended uniqueness.
But to find all users in group we can set: PRIMARY KEY (groupname)
Try that once. Create a new table using only groupname as your PRIMARY KEY, and then try to insert multiple usernames for each group. You will find that there will only ever be one group, and that the username column will be overwritten for each new user within that group.
But to find users only by groupname is any difference between PRIMARY KEY (groupname) and PRIMARY KEY (groupname, username) in terms of query efficiency?
If PRIMARY KEY (groupname) performs faster, the most-likely reason is because there can be only a single row returned.
In this case, defining username as a clustering key provides:
The ability to sort by username within a group.
The ability to query for a specific username within a group.
The ability to add multiple usernames within a group.
You don't need the clustering key if you want to query by groupname.
If you add a clustering key (username in this exemple) rows will be ordered by username for a groupname.

Select Cassandra row key

What criteria should be considered when selecting a rowid for a column family in cassandra? I want to migrate a relational database which does not contain any primary key. In that case what should be the best rowid selection?
Use natural keys that can be derived from the dataset if possible (e.g. phone_number for phone book, user_name for user table). If thats not possible, use a UUID.
There are many things to consider when consider the primary key of the cassandra system
Understand the difference between primary and partition key
CREATE TABLE users (
user_name varchar PRIMARY KEY,
password varchar,
);
In the above case primary and partition keys are the same.
CREATE TABLE users (
user_name varchar,
user_email varchar,
password varchar,
PRIMARY KEY (user_name, user_email)
);
Here Primary key is the user_name and user_email together, where as user_name is the partition keys.
CREATE TABLE users (
user_name varchar,
user_email varchar,
password varchar,
PRIMARY KEY ((user_name, user_email))
);
Here the primary key and partition keys are both equal to user_name,user_email
Carefully define your partition key. Partition keys are used for lookups by cassandra, so you must define your partition key by looking at your select queries.
Cassandra organizes data where partition keys are used for lookups, using the previous example
For the first case:
user_name ---> email:password email:data_of_birth
ABC --> abc#gmail.com:abc123 abc#gmail.com:22/02/1950 abc#yahoo.com:def123...
In the second case:
user_name,email ---> password data_of_birth ABC,abc#gmail.com --> abc123 22/02/1950
Making partition key more complex containing many data will make sure that you have many rows instead of a single row with many columns. It might be beneficial to balance the number of rows you might induce vs the number of columns each row might have. Having incredible large of small rows might not be too beneficial for reads
Partition keys indicate how data is distributed across nodes, so consider whether you have hotspots and decide whether you want to break it further.
Case 1:
All users named ABC will be in a single node
Case 2:
Users named ABC might or might not be in the single node, depending on the key that is generated along with their email.
Your partition key(s) should be how you want to store the data and how you will always look it up. You can only retrieve data by partition key, so it's important to choose something that you will naturally look up (this is why sometimes data is denormalized in Cassandra by storing it in multiple tables that mimic materialized views).
The clustering column key(s), if any, are mostly useful if you sometimes want to retrieve all the data in a partition and sometimes only want some of it. This is great for things like timeseries data because you can cluster the data on a timeuuid, store it sorted, and then do efficient range queries over the data.

Cassandra: best usage of partition and row key

I have the following data structure:
{
ClientId: string,
ItemId: string,
Item : string
}
I want to store this data in a Cassandra cluster. I know that some clients have much more items than others, yet I want to store data evenly on every node of my cluster since I have only one single query by ClientId and Item id together.
As far as I get I need to specify partition key like to distribute data evenly, so in CQL it would look like:
CREATE TABLE IF NOT EXISTS mykeyspace.mytable
(
ClientId text,
ItemId text,
Item text,
PRIMARY KEY((ClientId, Id))
);
Do I need to specify anything as a row key? ClientId+ItemId uniquely identifies any row, so should I put anything after the first closing ")"?
The one way is to make a hash of your partition keys and then use the hash as the partition key.
Also you could also add the time of the last purchase in there ((ClientId, ItemId, lastPurchaseTime))
Do I need to specify anything as a row key? ClientId+ItemId uniquely identifies any row, so should I put anything after the first closing ")"?
Your example schema will do exactly what you want and work well. There's no need to add anything else to the primary key.
(If you added more columns to the primary key, they would serve as clustering columns, which control the ordering of the rows on disk for a single partition.)

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

Resources