How Cassandra stores multicolumn primary key (CQL) - cassandra

I have a little misunderstanding about composite row keys with CQL in Cassandra.
Let's say I have the following
cqlsh:testcql> CREATE TABLE Note (
... key int,
... user text,
... name text
... , PRIMARY KEY (key, user)
... );
cqlsh:testcql> INSERT INTO Note (key, user, name) VALUES (1, 'user1', 'name1');
cqlsh:testcql> INSERT INTO Note (key, user, name) VALUES (1, 'user2', 'name1');
cqlsh:testcql>
cqlsh:testcql> SELECT * FROM Note;
key | user | name
-----+-------+-------
1 | user1 | name1
1 | user2 | name1
How this data is stored? Are there 2 rows or one.
If two then how it is possible to have more than one row with the same key?
If one then having records with key=1 and user from "user1" to "user1000" does it mean it will have one row with key=1 and 1000 columns containing names for each user?
Can someone explain what's going on on the background? Thanks.

So, after diging a bit more and reading an article suggested by Lyuben Todorov (thank you) I found the answer to my question.
Cassandra stores data in data structures called rows which is totally different than relational databases. Rows have a unique key.
Now, what's happening in my example... In table Note I have a composite key defined as PRIMARY KEY (key, user). Only the first element of this key acts as a row key and it's called partition key. Internally the rest of this key is used to build a composite columns.
In my example
key | user | name
-----+-------+-------
1 | user1 | name1
1 | user2 | name1
This will be represented in Cassandra in one row as
-------------------------------------
| | user1:name | user2:name |
| 1 |--------------------------------
| | name1 | name1 |
-------------------------------------
Having know that it's clear that it's not a good idea to add any column with huge amount of unique values (and growing) to the composite key because it will be stored in one row. Even worse if you have multiple columns like this in a composite primary key.
Update: Later I found this blog post by Aaron Morton than explains the same in more details.

Related

Cassandra find records where list is empty [duplicate]

How do I query in cassandra for != null columns.
Select * from tableA where id != null;
Select * from tableA where name != null;
Then I wanted to store these values and insert these into different table.
I don't think this is possible with Cassandra. First of all, Cassandra CQL doesn't support the use of NOT or not equal to operators in the WHERE clause. Secondly, your WHERE clause can only contain primary key columns, and primary key columns will not allow null values to be inserted. I wasn't sure about secondary indexes though, so I ran this quick test:
create table nullTest (id text PRIMARY KEY, name text);
INSERT INTO nullTest (id,name) VALUES ('1','bob');
INSERT INTO nullTest (id,name) VALUES ('2',null);
I now have a table and two rows (one with null data):
SELECT * FROM nullTest;
id | name
----+------
2 | null
1 | bob
(2 rows)
I then try to create a secondary index on name, which I know contains null values.
CREATE INDEX nullTestIdx ON nullTest(name);
It lets me do it. Now, I'll run a query on that index.
SELECT * FROM nullTest WHERE name=null;
Bad Request: Unsupported null value for indexed column name
And again, this is done under the premise that you can't query for not null, if you can't even query for column values that may actually be null.
So, I'm thinking this can't be done. Also, if null values are a possibility in your primary key, then you may want to re-evaluate your data model. Again, I know the OP's question is about querying where data is not null. But as I mentioned before, Cassandra CQL doesn't have a NOT or != operator, so that's going to be a problem right there.
Another option, is to insert an empty string instead of a null. You would then be able to query on an empty string. But that still doesn't get you past the fundamental design flaw of having a null in a primary key field. Perhaps if you had a composite primary key, and only part of it (the clustering columns) had the possibility of being empty (certainly not part of the partitioning key). But you'd still be stuck with the problem of not being able to query for rows that are "not empty" (instead of not null).
NOTE: Inserting null values was done here for demonstration purposes only. It is something you should do your best to avoid, as inserting a null column value WILL create a tombstone. Likewise, inserting lots of null values will create lots of tombstones.
1) select * from test;
name | id | address
------------------+----+------------------
bangalore | 3 | ramyam_lab
bangalore | 4 | bangalore_ramyam
bangalore | 5 | jasgdjgkj
prasad | 11 | null
prasad | 12 | null
india | 6 | karnata
india | 7 | karnata
ramyam-bangalore | 3 | jasgdjgkj
ramyam-bangalore | 5 | jasgdjgkj
2)cassandra does't support null values selection.It is showing null for our understanding.
3) For handling null values use another strings like "not-available","null",then we can select data

Cassandra Token or Hash Values

I have a student table in Cassandra with column named StudentId as primary key. Can two values from this column have same token/hash value?
Table structure
|-----------|-------------|
| StudentId | Primary Key |
| FName | |
| FName | |
|-----------|-------------|
So I think that I get what you're trying to ask here. When determining data distribution, the partition key (first part of the PRIMARY KEY) is hashed to obtain a token. That row is then written to the node(s) responsible for that particular token range.
As for having the same hash value, it is important to note that PRIMARY KEYs in Cassandra are unique. Therefore, to have the same hashed token value, rows would have to have identical partition keys, which is not possible.
To demonstrate this, I have re-created your table and INSERTed a few rows:
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
lname TEXT);
INSERT INTO student (studentid, fname, lname) VALUES ('aploetz','Aaron','Ploetz');
INSERT INTO student (studentid, fname, lname) VALUES ('aploetz','Avery','Ploetz');
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO student (studentid, fname, lname) VALUES ('mgin','Micah','Gin');
Now I will query that table, utilizing the token function on the partition key (studentid):
SELECT token(studentid),studentid,fname,lname FROM student ;
system.token(studentid) | studentid | fname | lname
-------------------------+-----------+-------+----------
-5626264886876159064 | janderson | Jordy | Anderson
-1472930629430174260 | aploetz | Avery | Ploetz
8993000853088610283 | mgin | Micah | Gin
(3 rows)
Notes:
In using the token function on the partition key, I can see the hashed token values, thus I can determine which node(s) in the cluster will contain this data.
The first two students I inserted had different names, but they ended up with the same studentid of aploetz. As PRIMARY KEYs are unique, only one persisted.
The row for Avery Ploetz "won," as it was written last.
Let me know if you require any further explanation, but I hope this helps to answer your question.

Usage of cqlsh is similar with mysql, what's the difference?

cqlsh create table:
CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
emp_city text,
emp_sal varint,
emp_phone varint
);
insert data
INSERT INTO emp (emp_id, emp_name, emp_city,
emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000);
select data
SELECT * FROM emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000
3 | Chennai | rahman | 9848022330 | 45000
looks just same as mysql, where is column family, column?
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp.
so table emp is a column family?
INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000); is a row which contains columns?
column here is something like emp_id=>1 or emp_name=>ram ??
In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time.
what does this mean?
I can have something like this?
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000 | asdfasd | asdfasdf
3 | Chennai | rahman | 9848022330 | 45000
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Where is super column, how to create it?
Column family is an old name, now it's called just table.
About super column, also an old term, you have "Map" data type for example, or user defined data types for more complex structures.
About freely adding columns - in the old days, Cassandra was working with unstructured data paradigm, so you didn't had to define columns before you insert them, for now it isn't possible, since Cassandra team moved to be "structured" only (as many in the DB's industry came to conclusion that unstructured data makes more problems than effort).
Anyway, Cassandra's data representation on storage level is very different from MySQL, and indeed saves only data for the columns that aren't empty. It may look same row when you are running select from cqlsh, but it is stored and queried in very different way.
The name column family is an old term for what's now simply called a table, such as "emp" in your example. Each table contains one or many columns, such as "emp_id", "emp_name".
When saying something like being able to freely add columns any time, this would mean that you're always able to omit values for columns (will be null) or add columns using the ALTER TABLE statement.

Duplicate rows/columns for the same primary key in Cassandra

I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".

Cassandra: can you add dynamic columns within existing column clustering?

I'm using Cassandra 1.2.12 with CQL 3, and am having trouble modeling my column family.
I currently store snapshots of customer data at particular times. Works great:
CREATE TABLE data (
cust_id varchar,
time timeuuid,
data_text text,
PRIMARY KEY (cust_id, time)
);
The cust_id is the partition key and time is the clustering id, so, as I understand it, I can think of each row in the table like:
| cust_id | timeuuid1 : data_text | timeuuid2 : data_text |
| CUST1 | data at this time | data at this time |
Now I'd like to store another group of metrics for each snapshot - but the name of each of these columns isn't fixed. So something like:
| cust_id | timeuuid1 : data_text | timeuuid1 : dynamicCol1 | timeuuid1 : dynamicCol2 | timeuuid1 : dynamicColN |
| CUST1 | data |{some value} |{some value} |{some value} |
I've achieved dynamic columns for timestamp by using a composite primary key, but I can't see how to achieve this within each cluster of columns, if you see what I mean.
If I add, say, "dynamicColumnName" to the existing composite key, I'll end up with customer data stored for each dynamic column, which is not what I want.
Is this possible, without using a Map column? Hope you can help, thanks!
I am not a CQL user... With the thrift API you dynamically add a column to a column family by inserting/updating a record with a value for a column with name X. The column X will start to exist right there and then for that record.
Have you tried an INSERT statement specifying a column that you have not explicitly defined? I would expect that to have the same effect (column is created).

Resources