row key in cassandra table - cassandra

I am new to Cassandra, I am confused between rowkey and partition key in Cassandra.
I am creating a table like:
Create table events( day text, hour text, dip text, sip text, count counter,
primary key((day,hour), dip, sip));
As per my understanding, in the above table day and hour columns form a partition key and dip,sip columns form a clustering key.
My understanding is that row key is nothing but partition key i.e. day, hour columns form a row key.
Is my understanding correct? Can any one clarify this?

Is my understanding correct, Can any one clarify this?
Yes, your understanding is correct. The row key is the "old school" way of referring to a partition key. The partition key (as you probably understand) is the part of the CQL PRIMARY KEY which determines where the data is stored in the cluster. In your case, data within your partition keys will be sorted by dip and sip (your clustering keys).
You should give John Berryman's article Understanding How CQL3 Maps To Cassandra’s Internal Data Structure a read. It does a great job of explaining how your table structures map "under the hood."

Related

Should every table in Cassandra have a partition key?

I am trying to create a Cassandra table where i store the logs for a shop as per the timestamp. I also want to create a query which returns the data in a descending order with respect to the timestamp. If I make my timestamp as the primary key it will be automatically be the partition key as i don't have any other columns as composite primary key.
And in Cassandra we can't do ORDER BY on partition keys. Is there any way that I make my timestamp as primary key and not as partition key (A Cassandra DB without a partition key).
Thanks in advance.
table creation if required :
CREATE TABLE myCass.logs(timestamp timestamp, logs text, PRIMARY KEY (timestamp));
Since you have the timestamp you know the year, month, day. You could use those as your partition key and have the timestamp as a clustering column. In this way you would satisfy also the need for a partition key, you will have a primary key for the data, you could order by on timestamps and you would evenly spread your data across the cluster.
This way of splitting data is called bucketing. Here is some good reading on this subject - Cassandra Time Series Data Modeling For Massive Scale

How to get latest data from primary key of (user_id and date) in Cassandra [duplicate]

In Cassandra, I can create a composite partition key, separate from my clustering key:
CREATE TABLE footable (
column1 text,
column2 text,
column3 text,
column4 text,
PRIMARY KEY ((column1, column2))
)
As I understand it, quering by partition key is an extremely efficient (the most efficient?) method for retrieving data. What I don't know, however, is whether it's also efficient to query by only part of a composite partition key.
In MSSQL, this would be efficient, as long as components are included starting with the first (column1 instead of column2, in this example). Is this also the case in Cassandra? Is it highly efficient to query for rows based only on column1, here?
This is not the case in Cassandra, because it is not possible. Doing so will yield the following error:
Partition key part entity must be restricted since preceding part is
Check out this Cassandra 2014 SF Summit presentation from DataStax MVP Robbie Strickland titled "CQL Under the Hood." Slides 62-64 show that the complete partition key is used as the rowkey. With composite partitioning keys in Cassandra, you must query by all of the rowkey or none of it.
You can watch the complete presentation video here.
This is impossible in Cassandra because it would require a full table scan to resolve such a query. The location of the partition is defined by a hash of all members of the composite key, this means giving only half of the key is as good as giving none of it. The only way to find the record is to search through all keys and check if they match.

Cassandra - Internal data storage when no clustering key is specified

I'm trying to understand the scenario when no clustering key is specified in a table definition.
If a table has only a partition key and no clustering key, what order the rows under the same partition are stored in? Is it even allowed to have multiple rows under the same partition when no clustering key exists? I tried searching for it online but couldn't get a clear explanation.
I got the below explanation from Cassandra user group so posting it here in case someone else is looking for the same info:
"Note that a table always has a partition key, and that if the table has
no clustering columns, then every partition of that table is only
comprised of a single row (since the primary key uniquely identifies
rows and the primary key is equal to the partition key if there is no
clustering columns)."
http://cassandra.apache.org/doc/latest/cql/ddl.html#the-partition-key

Cassandra Defining Primary key and alternatives

Here is a simple example of the user table in cassandra. What is best strategy to create a primary key.
My requirements are
search by uuid
search by username
search by email
All the keys mentioned will be high cardinality keys. Also at any moment I will be having only one of them to search
PRIMARY KEY(uid,username,email)
What if I have only the username ?, Then the above primary key is not use ful. I am not able visualize a solution to achieve this using compound primary key?
what are other options? should we go with a new table with username to uid, then search the user table. ?
From all articles out there on the internet recommends not to create secondary index for high cardinality keys
CREATE TABLE medicscity.user (
uid uuid,
fname text,
lname text,
user_id text,
email_id text,
password text,
city text,
state_id int,
country_id int,
dob timestamp,
zipcode text,
PRIMARY KEY (??)
)
How do we solve this kind of situation ?
Yes, you need to go with duplicate tables.
If ever in Cassandra you face a situation in which you will have to query a table based on column1, column2 or column3 independently. You will have to duplicate the tables.
Now, how much duplication you have to use, is individual choice.
Like, in this example, you can either duplicate table with full data.
Or, you can simply create a new table column1 (partition), column2, column 3 as primary key in main table.
Create a new table with primary key of column1, column2, column3 and partition key on column2.
Another one with same primary key and partition key on column3.
So, your data duplicate will be row, but in this case you will end up querying data twice. One from duplicate table, and one from full fledged table.
Big data technology, is there to speed up computation and let your system scale horizontally, and it comes at the expense of disk/storage. I mean just look at everything, even its base of replication factor does duplication of data.
Your PRIMARY KEY(uuid,username,email) don't fit your requirement. Because you can't search for the clustering column without fill the Partition Key, and even the second clustering column without fill the first clustering column.
e.g. you cannot search for username without uuid in WHERE clause and cannot search for email without uuid and username too.
All you need is the denormalization and duplicate data.
Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.
In your case, you need to create 3 tables that have the same column (data that you want to get), but these 3 tables will have different PRIMARY KEY, one have uuid as PK, one have username as PK, and one have email as PK. :)

Querying Cassandra by a partial partition key

In Cassandra, I can create a composite partition key, separate from my clustering key:
CREATE TABLE footable (
column1 text,
column2 text,
column3 text,
column4 text,
PRIMARY KEY ((column1, column2))
)
As I understand it, quering by partition key is an extremely efficient (the most efficient?) method for retrieving data. What I don't know, however, is whether it's also efficient to query by only part of a composite partition key.
In MSSQL, this would be efficient, as long as components are included starting with the first (column1 instead of column2, in this example). Is this also the case in Cassandra? Is it highly efficient to query for rows based only on column1, here?
This is not the case in Cassandra, because it is not possible. Doing so will yield the following error:
Partition key part entity must be restricted since preceding part is
Check out this Cassandra 2014 SF Summit presentation from DataStax MVP Robbie Strickland titled "CQL Under the Hood." Slides 62-64 show that the complete partition key is used as the rowkey. With composite partitioning keys in Cassandra, you must query by all of the rowkey or none of it.
You can watch the complete presentation video here.
This is impossible in Cassandra because it would require a full table scan to resolve such a query. The location of the partition is defined by a hash of all members of the composite key, this means giving only half of the key is as good as giving none of it. The only way to find the record is to search through all keys and check if they match.

Resources