Understanding the relationship between primary key and partitioning in Cassandra - cassandra

I am new to Cassandra and have a few novice level questions in the primary key.
Is the Primary key supposed to be unique per record? (My guess would be not.)
To elaborate. Suppose my table looks like this
CREATE TABLE user_action (
user_id int,
action text,
date_of_action date,
PRIMARY KEY (user_id)
)
I am guessing I can have multiple rows with the same user_id
If primary key is not one per record, can a primary key be split across many partitions?
Can a partition have multiple primary keys?
Is the primary key itself decided to pick the partition or is the hashCode of the primary key used to pick a partition?
Is it fair to think of a partition as a file?

Primary key and Partition key in some case would be the same but not always, it depends upon the number of primary keys. Data is distributing based on partition key which is unique across the Cassandra cluster. I am not explaining all the scenario and concept here but yes, you should go through the documentation and I am sure you can understand the things very quick after reading the below link.
https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useCompoundPrimaryKeyConcept.html

1>Is the Primary key supposed to be unique per record? (My guess would
be not.) To elaborate. Suppose my table looks like this
CREATE TABLE user_action ( user_id int, action text, date_of_action
date, PRIMARY KEY (user_id) )
Primary key is supposed to be unique per record /row. In the example you mentioned, you can have only one record for user_id. For allowing multiple rows with same user_id, you have to introduce a differentiating key. This key is called clustering key in Cassandra and it forms a part of primary key.
Primary key is a combination of (partition key and clustering key(s)). Partition key is used by Cassandra to find a partition/record. If clustering key is defined in data model then it will be used to differentiate different rows. If no clustering key is defined as in your case then only one record will be kept in database.
In example below you can have same user_id records who live different states. Here Primary key is combination of (user_id, state). user_id is the partition key and state is clustering key.
CREATE TABLE user_action (
user_id int,
state text,
action text,
date_of_action date,
PRIMARY KEY (user_id,state)
)
I am guessing I can have multiple rows with the same user_id
As explained above you can have multiple rows with the same user_id if you define a clustering key otherwise with the example you quoted, it is not possible.
2>If primary key is not one per record, can a primary key be split
across many partitions?
Primary key cannot be split across many partitions. As explained above partition key part of primary key will always point to unique partition.
3>Can a partition have multiple primary keys?
In the example I have quoted, (1,RJ), (1,GJ) can be possible primary keys pointing towards single partition pointed by parition key value 1. So you can have multiple primary keys for a partitions in that sense.
4>Is the primary key itself decided to pick the partition or is the
hashCode of the primary key used to pick a partition?
Hashcode of partition key (part of primary key) is used to get the partition
5>Is it fair to think of a partition as a file?
It will depend on your data model.

Related

Order of column in composite partitioning key

I am using Scylla database and I have created a partitioning key composite of two columns.
Does the order of keys matter in this case?
Table definition
create table X(
user_id text,
city text,
name text,
PRIMARY KEY ((user_id, city))
);
will anything change if I write
PRIMARY KEY ((city, primary_key))?
In a composite partition key the order does not matter.
Switching the order of the keys may result in different hash values. But it shouldn't reduce the efficiency of data distribution.

Cassandra DB misunderstanding partition key and primary key

Good Evening,
my problem is, that my recent understanding for partition and primary key is, that the partition key is to distribute the data between the nodes, and the primary ALWAYS contains the partition key. I want to create a partition key to cluster the data with duplicate partition keys and in these clusters I want to have a primary key for unique rows. In my first understanding of Cassandra, it could be possible if can take apart the partition and primary key. Is this possible?
An example to ease my idea:
country
state
unique_id
USA
TEXAS
123
USA
TEXAS
114
country and state as the partition key and the unique id as the primary key.
If I create the primary key like this: PRIMARY KEY ((country, state,unique_id)) I can't filter without using the unique_id but I want e.g. a query like SELECT unique_id FROM table WHERE state = 'Texas' and country = 'USA'.
If I create the primary key in this way: PRIMARY KEY ((country, state)), it obviously overwrites the data every time one entry gets inserted with the same country and state that's why I need the unique primary key.
Primary key always includes the partition key, that's always a first item in the primary key. Partition key could consist out of multiple columns, that's why you have brackets around first item in your example. I believe that in your case, primary key should be as following:
PRIMARY KEY ((country, state),unique_id)
In this case, partition key is a combination of country + state, and then inside that partition you will have unique IDs that will be used to select specific items. General syntax for primary key is:
partition key, clustering column1, clustering column2, ...
where partition key could be either:
column - single column
(column1, column2, ...) - multiple columns

How do order by with one primary key cassandra?

I'm trying to use the order by feature of cassandra, but with only one primary key. But when I try to create my table, this is what cassandra returns.
CREATE TABLE user_classement
(
user_name set<text>,
score float,
PRIMARY KEY (score)
) WITH CLUSTERING ORDER BY (score DESC);
But cassandra throws this error:
Clustering key columns must exactly match columns in CLUSTERING ORDER BY directive
In case there are two primary keys when I create a new column, it works but with only one primary key, I get this error.
Do you know if it is possible to make an order by with only one primary key?
primary key in Cassandra consists of partition key and clustering key. First part in primary key represents partition key. So in your example score is the partition key and ordering can be applied on clustering keys. If you have had a primary key like PRIMARY KEY (score, rank) then you can apply ordering on rank. For partition ordering you may try ByteOrderedPartitioner. But I have not tried it so cannot comment further than this.
Edit 1: As added by Aaron in comments only Murmur3 paritioner should be used. ByteOrderPartitioner is only for backward compatibility for upgrade from old versions.

Is it necessary to use all the columns defined as the primary key to query a Cassandra database?

I am using Cassandra database and need to define the Primary Key which is a combination of partition key and clustering keys. The cassandra database needs to be queried based on the combination of two fields i.e. a customer number and createdAt (Unix timestamp value), as per the business requirement. These columns cannot be used as Primary key because they cannot uniquely identify a row in the database. So, is it correct to add the uuid column from database as a clustering key to make the primary key unique, so that the Primary key will become a combination of - customerNumber(Partition key), createdAt (ClusteringKey), uuid( clustering key). However the database will never be queried based on the whole primary key. It will always be queried based on the part of the Primary key i.e. Customer Number and createdAt. uuid will never be used to query the database.
So if I understand correctly, your PRIMARY KEY definition looks like this:
PRIMARY KEY (customerNumber,createdAt,uuid)
It will always be queried based on the part of the Primary key
Yes, querying by part of the PRIMARY KEY definition is fine, in your case. Cassandra tries to restrict queries to a single node, and it achieves this by ensuring that an entire partition is written to a single node (and then replicated). Because of this, you really only need to supply the partition key on your queries (customerNumber), and they should work.
Supplying an additional PRIMARY KEY component however, is helpful. In a high-throughput scenario, the smaller you can keep your result set payloads, the better.
tl;dr;
Querying by customerNumber and createdAt will be just fine.

Cassandra - Internal data storage when no clustering key is specified

I'm trying to understand the scenario when no clustering key is specified in a table definition.
If a table has only a partition key and no clustering key, what order the rows under the same partition are stored in? Is it even allowed to have multiple rows under the same partition when no clustering key exists? I tried searching for it online but couldn't get a clear explanation.
I got the below explanation from Cassandra user group so posting it here in case someone else is looking for the same info:
"Note that a table always has a partition key, and that if the table has
no clustering columns, then every partition of that table is only
comprised of a single row (since the primary key uniquely identifies
rows and the primary key is equal to the partition key if there is no
clustering columns)."
http://cassandra.apache.org/doc/latest/cql/ddl.html#the-partition-key

Resources