I have a problem with understanding a one thing from this article - http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
Exercise - We want get all users by groupname.
Solution:
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
PRIMARY KEY (groupname, username)
);
SELECT * FROM groups WHERE groupname = 'footballers';
But to find all users in group we can set: PRIMARY KEY (groupname) and it work's also.
Why is needed in this case a clustering key (username)? I know that when we set username as the clustering key we can use it in a WHERE clause. But to find users only by groupname is any difference between PRIMARY KEY (groupname) and PRIMARY KEY (groupname, username) in terms of query efficiency?
Clustering keys provide multiple benefits: Query flexibility, result set ordering (within a partition key) and extended uniqueness.
But to find all users in group we can set: PRIMARY KEY (groupname)
Try that once. Create a new table using only groupname as your PRIMARY KEY, and then try to insert multiple usernames for each group. You will find that there will only ever be one group, and that the username column will be overwritten for each new user within that group.
But to find users only by groupname is any difference between PRIMARY KEY (groupname) and PRIMARY KEY (groupname, username) in terms of query efficiency?
If PRIMARY KEY (groupname) performs faster, the most-likely reason is because there can be only a single row returned.
In this case, defining username as a clustering key provides:
The ability to sort by username within a group.
The ability to query for a specific username within a group.
The ability to add multiple usernames within a group.
You don't need the clustering key if you want to query by groupname.
If you add a clustering key (username in this exemple) rows will be ordered by username for a groupname.
Related
I have a table with a composite primary key. name,description, ID
PRIMARY KEY (id, name, description)
whenever searching Cassandra I need to provide the three keys, but now I have a use case where I want to delete, update, and get just based on ID.
So I created a materialized view against this table, and reordered the keys to have ID first so I can search just based on ID.
But how do I delete or update record with just an ID ?
It's not clear if you are using a partition key with 3 columns, or if you are using a composite primary key.
If you are using a partition key with 3 columns:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY ((id, name, description))
);
notice the double parenthesis you need all 3 components to identify your data. So when you query your data by ID from the materialized view you need to retrieve also both name and description fields, and then issue one delete per tuple <id, name, description>.
Instead, if you use a composite primary key with ID being the only PARTITION KEY:
CREATE TABLE tbl (
id uuid,
name text,
description text,
...
PRIMARY KEY (id, name, description)
);
notice the single parenthesis, then you can simply issue one delete because you already know the partition and don't need anything else.
Check this SO post for a clear explanation on primary key types.
Another thing you should be aware of is that the materialized view will populate a table under the hood for you, and the same rules/ideas about data modeling should also apply for materialized views.
I have a scenario, let's say below is my cassandra table
CREATE TABLE USER (
id TEXT,
name TEXT,
age int,
role TEXT,
PRIMARY KEY ((id, role), age));
Now I should be able to query user table using either id or role or both id and role. My question is when I use only id or role in the WHERE clause to find user, in this case will cassandra search for user record in different partition(and nodes)? As I am not searching user using both id and role which make the PK of my table.
When you use a compound partition key like in your example PRIMARY KEY ((id, role), age)
Cassandra will concatenate the two values together. It's a technique used to create a more unique or sometimes granular partition key to better control how evenly data gets distributed around the respective datacenter.
Because id and role are concatenated then hashed, you must always provide both the id AND role. Cassandra will not know what to do if you give only part of the compound partition key.
I have read here that for a table like:
CREATE TABLE user (
username text,
password text,
email text,
company text,
PRIMARY KEY (username)
);
We can create a table like:
CREATE TABLE user_by_company (
company text,
username text,
email text,
PRIMARY KEY (company)
);
In order to support query by the company. But what about primary key uniqueness for the second table?
Modify your table's PRIMARY KEY definition and add username as a clustering key:
CREATE TABLE user_by_company (
company text,
username text,
email text,
PRIMARY KEY (company,username)
);
That will enforce uniqueness, as well as return all usernames for a particular company. Additionally, your result set will be sorted in ascending order by username.
data will be partitioned by the company name over nodes. What if there is a lot of users from one company and less from other one. Data will be partition'ed in a non balanced way
That's the balance that you have to figure out on your own. PRIMARY KEY definition in Cassandra is a give-and-take between data distribution and query flexibility. And unless the cardinality of company is very low (like single digits), you shouldn't have to worry about creating hot spots in your cluster.
Also, if one particular company gets too big, you can use a modeling technique known as "bucketing." If I was going to "bucket" your user_by_company table, I would first add a company_bucket column, and it as an additional (composite) partitioning key:
CREATE TABLE user_by_company (
company text,
company_bucket text,
username text,
email text,
PRIMARY KEY ((company,company_bucket),username)
);
As for what to put into that bucket, it's up to you. Maybe that particular company has East and West locations, so something like this might work:
INSERT INTO user_by_company (company,company_bucket,username,email)
VALUES ('Acme','West','Jayne','jcobb#serenity.com');
The drawback here, is that you would then have to provide company_bucket whenever querying that table. But it is a solution that could help you if a company should get too big.
I think there is typo in the blog (the link you mentioned). You are right with the table structure as user_by_company there will be issue with uniqueness.
To support the typo theory:
In this case, creating a secondary index in the company field in the
user table could be a solution because it has much lower cardinality
than the user's email but let’s solve it with performance in mind.
Secondary indexes are always slower than dedicated table approach.
This are the lines mentioned in the blog for querying user by company.
If you were to define company as primary key OR part of primary key there should be no need to create secondary index.
I'm trying to understand data modeling in Cassandra coming from a relational background using this article.However, I fail to understand one of the examples.
In Example 2 User Groups:
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
PRIMARY KEY (groupname, username)
)
Note that the PRIMARY KEY has two components: groupname, which is the
partitioning key, and username, which is called the clustering key.
This will give us one partition per groupname. Within a particular
partition (group), rows will be ordered by username. Fetching a group
is as simple as doing the following:
SELECT * FROM groups WHERE groupname = ?
However, what I fail to understand is, if we were to create a group, we'd be be passing a single group name and corresponding user name in the insert.
So, how would it be possible to retrieve all the users belonging to a single group using the select statement? Also, since the groupname is the primary key, we can't add more users with the same groupname as it would lead to a violation.
You can think of a partition as a data bucket. It can hold a single row or multiple rows of data. When you read that data bucket, Cassandra can very efficiently access all the rows within the bucket, or just a range of rows you specify by the clustering key.
A partition is the unit of replication within Cassandra, so all the data within one partition bucket is stored on a single node (with possibly extra copies on other nodes if you use a higher replication factor than one).
But the partition key is only part of the key. Each row in the bucket still needs to have a unique primary key, so in that example, each user you stored in a particular group partition would need to have a different user name. So it is the combination of groupname and username that needs to be unique. You can always insert more users under the same groupname as long as each username within the group is different. If you inserted with a duplicate username, then it would be an update to the row with that username instead of adding a row.
I want to know where and when to use composite column and composite key in Cassandr
Composite-key usage can be useful in many scenario.
Imagine a table users in which you have a user id (uuid) as primary key (single-key)
You can't query this table unless you know the id of the user (ignore secondary indexes at the moment).
Now let's consider a table in which you doesn't use anymore the id as primary key, but you use a composite key made of (name, surname, email)
Now you can query your users by knowing
name
name - surname
name - surname - emails
primary key must be unique and in this scenario the email should guarantee that the user is unique. Take care, to query by email you must know both name and surname, to query for the surname you must know the name (this is true unless you don't use a particular way to model yours data).
CREATE TABLE users (
name text,
surname text,
email text,
age int,
address text,
id uuid,
PRIMARY KEY (name, surname, email)
)
Another useful scenario can be for data-sorting.
Imagine you have a table in which you keep tweets (identified by a time uuid) made from a tweeter (identified by a uuid).
In cassandra the first part of the key is known as Partition key, the remaining is known as clustering key. For the same partition key data can be sorted by a clustering key
CREATE TABLE tweets (
tweeter_id uuid,
tweet_id timeuuid,
content text,
PRIMARY KEY (tweeter_id, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_id DESC)
In this scenario once you ask cassandra the tweets made by a tweeter they will be time-sorted for free.
Cheers,
Carlo