Cassandra - What is meant by - "cannot rename non primary key part" - cassandra

I have created a table users as follows:
create table users (user_id text primary key, email text, first_name text, last_name text, session_token int);
I am referring to the CQL help documentation on the DataStax website.
I now want to rename the email column to "emails". But I when I execute the command -
alter table users rename email to emails;
I am getting the error -
Bad Request: cannot rename non primary key part email
I am using CQL 3 . My CQLSH is 3.1.6 and C* is 1.2.8.
Why cannot I rename the above column? If I run help alter table, it shows the option to rename the column. How do I rename the column?

In CQL, you can rename the column used as the primary key, but not any others. This seems opposite from what it should be, one would think that the primary key would need to stay the same and the others would be easy to change! The reason comes from implementation details.
The name of the primary key is not written into each row, rather it is stored in a different place that's easily changeable. But for non-primary key fields, the names of the fields are written into each row. In order to rename the column, the system would have to rewrite every single row.
This article has some fantastic examples and a much longer discussion of Cassandra's internals.
To borrow an example directly from the article, consider this example column family:
cqlsh:test> CREATE TABLE example (
... field1 int PRIMARY KEY,
... field2 int,
... field3 int);
Insert a little data:
cqlsh:test> INSERT INTO example (field1, field2, field3) VALUES ( 1,2,3);
And then the Cassandra-CLI output (not CQLSH) from querying this column family:
[default#test] list example;
-------------------
RowKey: 1
=> (column=, value=, timestamp=1374546754299000)
=> (column=field2, value=00000002, timestamp=1374546754299000)
=> (column=field3, value=00000003, timestamp=1374546754299000)
The name of the primary key, "field1" is not stored in any of the rows, but "field2" and "field3" are written out, so changing those names would require rewriting every row.
So if you really still want to rename a non-primary column, there are basically two different strategies and neither of them are very desirable.
Drop the column and add it back, as another poster mentioned. This has the big downside of dropping all the data in that column.
or
Create a new column family that is basically a copy of the old but with the column in question renamed and rewrite your data there. This is, of course, very computationally expensive.

In order to RENAME the field, the only way I got it working was dropping the field first and then adding it in. So it is like this:
alter table users drop email;
alter table users add emails text;

The main purpose of the RENAME clause is to change the names of CQL 3-generated primary key and column names that are missing from a legacy table (table created with COMPACT STORAGE).

Related

Cassandra select CQL: Cannot add column after wildcard

I need to output the write timestamp as part of a table export for lots of tables, though I quite cannot figure out a way which does not force me to explicitely select all columns in the statement.
Instead of being able to do just this:
SELECT *, writetime(data) AS timestamp FROM dls.licenses;
I have to do that:
SELECT column1, column2, ... , writetime(data) AS timestamp FROM dls.licenses;
This is pretty unconvenient since it means I'd have to change the export tool every time the schema of any of the tables changes.
Is there a better way?
Edit: To clarify, the actual error I get is the following. The way the syntax is presented in the error one could think that the SQL should be ok:
SELECT *, writetime(id) AS timestamp FROM dls.licenses;
SyntaxException: line 1:8 mismatched input ',' expecting K_FROM (SELECT *[,]...)
Edit 2: Here is the keyspace and create statement used for this table:
CREATE KEYSPACE IF NOT EXISTS dls WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': ‚1‘ };
CREATE TABLE IF NOT EXISTS dls.licenses (subscription_id text, id text, key text, data text, PRIMARY KEY (key));
CREATE INDEX IF NOT EXISTS ON dls.licenses (id);
BTW: I'm using the fresh Cassandra 4.0.0 (GA).
If you are exporting to CSV or JSON files, you may consider using DataStax's dsbulk.
https://github.com/datastax/dsbulk
The latest version of dsbulk 1.8.0 added support to export writetime and ttl.
https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/schemaOptions.html#schemaOptions__schemaOptionsPreserveTimestamp
dsbulk unload -url myData.csv -k ks1 -t table1 --timestamp
The WHERE clause specifies which rows must be queried. It is composed of relations on the columns that are part of the PRIMARY KEY and/or have a secondary index defined on them.
The column specification of the relation must be one of the following:
One or more members of the partition key of the table
A clustering column, only if the relation is preceded by other relations that specify all columns in the partition key
A column that is indexed using CREATE INDEX.
In Cassandra 3.6 and later, add ALLOW FILTERING to filter only on a non-indexed cluster column.
You may be able to solve your query problem by creating a secondary index on the column you want the writetime for. Keep in mind secondary indexes create overhead and which may result in unintended consequences.
The star (*) in SELECT * is the CQL syntax for "ALL columns" so by definition, it is not possible to include another column since ALL of them are selected even for native CQL functions. For this reason, you need to enumerate all column names + functions-on-columns.
+1 to Yuki's answer. I wanted to add that DSBulk adds a WRITETIME() column for every column in the table because it isn't possible to know in advance the write-time of each column in the partition until the full partition has been read.
Allow me to explain it using a couple of examples.
Schema
Consider this table:
CREATE TABLE users_by_email (
email text,
name text,
address text,
mobile text,
PRIMARY KEY (email)
)
Example 1
If we add a new record with a value specified for all columns:
INSERT INTO users_by_email (email, name, address, mobile)
VALUES ('alice#staysafe.com', 'Alice', '221B Baker St', '098-765-432-109');
then for this partition, all columns will have the same write-time.
Example 2
Consider a situation where a record is fragmented across multiple inserts over a period of time such as:
INSERT INTO users_by_email (email, name) VALUES ('dude#getvaccinated.now', 'Bob');
INSERT INTO users_by_email (email, address) VALUES ('dude#getvaccinated.now', '350 Fifth Ave');
INSERT INTO users_by_email (email, mobile) VALUES ('dude#getvaccinated.now', '012-555-123-456');
Each of the columns name, address and mobile would all have different write-times.
From these 2 examples, you should see that there isn't always a single write-time that applies to all columns in the partition.
For your specific use case, you need to figure out from the DSBulk output which write-time to use for situations where the partition fragments are inserted/updated at different times. Cheers!

In Cassandra, why dropping a column from tables defined with compact storage not allowed?

As per datastx documentation here, we cannot delete column from tables defined with COMPACT STORAGE option. What is the reason for this?
This goes back to the original implementation of CQL3, and changes which were made to allow it to abstract a "SQL-like," wide-row structure on top of the original Thrift-based storage engine. Ultimately, managing the schema comes down to whether or not the underlying structure is a table or a column_family.
As an example, I'll create two tables using an old install of Apache Cassandra (2.1.19):
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT);
CREATE TABLE studentcomp (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT)
WITH COMPACT STORAGE;
I'll insert one row into each table:
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO studentcomp (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
And then I'll look at the tables with the old cassandra-cli tool:
[default#stackoverflow] list student;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=, value=, timestamp=1599248215128672)
=> (name=fname, value=4a6f726479, timestamp=1599248215128672)
=> (name=lname, value=416e646572736f6e, timestamp=1599248215128672)
[default#stackoverflow] list studentcomp;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=fname, value=Jordy, timestamp=1599248302715066)
=> (name=lname, value=Anderson, timestamp=1599248302715066)
Do you see the empty/"ghost" column value in the first result? That empty column value was CQL3's link between the column values and the table's meta data. If it's not there, then CQL cannot be used to manage a table's columns.
The comparator used for type conversion was all that was really exposed via Thrift. This lack of meta data control/exposure is what allowed Cassandra to be considered "schemaless" in the pre-CQL days. If I run a describe studentcomp from within the cassandra-cli, I can see the comparators (validation class) used:
Column Metadata:
Column Name: lname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Column Name: fname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
But if I try describe student, I see this:
WARNING: CQL3 tables are intentionally omitted from 'describe' output.
See https://issues.apache.org/jira/browse/CASSANDRA-4377 for details.
Sorry, no Keyspace nor (non-CQL3) ColumnFamily was found with name: student (if this is a CQL3 table, you should use cqlsh instead)
Bascially, tables and column families were different entities forced into the same bucket. Adding WITH COMPACT STORAGE essentially made a table a column family.
With that came the lack of any schema management (adding or removing columns), outside of access to the comparators.
Edit 20200905
Can we somehow / someway (hack) drop the columns from table?
You might be able to accomplish this. Sylvain Lebresne wrote A Thrift to CQL3 Upgrade Guide which will have some necessary details for you. I also advise reading through the Jira ticket mentioned above (CASSANDRA-4377), as that covers many of the in-depth technical challenges that make this difficult.

How is denormalization handled in cassandra

What is the best approach to update table with duplicate data?
I have a table
table users (
id text PRIMARY KEY,
email text,
description,
salary
)
I will delete, update, insert etc to this table. But I also have a requirement to be able to search by email, and description. If I create new table with new composite keys for email, and description,
when I update my base table I do
insert into users (id, salary) values (1, 500);
I do not have the required data to also update my secondary table since all the client has is id and salary. How is the second table updated.
Other workarounds and shortcomings
I could have created a materialized view, but since the base table has only one primary key I can only add one more column. my search requirement requires more than one column.
Create secondary indexes on the columns that will be searched on. But the performance for this would be bad since the columns I will be searching on would have high cardinality. i.e. description, email, etc
So, the "correct" way of doing this is to create 3 tables. salary_by_id, salary_by_email and salary_by_description.
table salary_by_id (
id text PRIMARY KEY,
salary int
)
table salary_by_email (
email text PRIMARY KEY,
salary int
)
table salary_by_description (
description text,
id int,
salary int,
primary key (description, id)
)
The reason i added id to salary_by_description is that, from my own guessing, description won't be globally uniq, so it has to have something else in it's primary key.
Depending on the size of these tables the last one might need something extra added to it's partitioning key. And if needed you can add id, email and description to the other tables.
Now, when inserting or deleting values you need so do it in all 3 tables. If you use a driver, like in java, that supports asynchronous calls, then this doesn't cost very much extra.

How to change PARTITION KEY column in Cassandra?

Suppose we have such table:
create table users (
id text,
roles set<text>,
PRIMARY KEY ((id))
);
I want all the values of this table to be stored on the same Cassandra node (OK, not really the same, same 3, but have all the data mirrored, but you got the point), so to achieve that i want to change this table to be like this:
create table users_v2 (
partition int,
id text,
roles set<text>,
PRIMARY KEY ((partition), id)
);
How can i do that without losing the data from the first table?
It seems to be impossible to ALTER TABLE in order to add such column. i'm OK with that.
What i try to do is to copy data from the first table and insert to the second table.
When i do it as it is, the partition column іs missing, which is expected.
I can ALTER the first table and add a 'partition' column to the end, and then COPY in correct order, but i can't update all the rows in the first table to set the all some partition, and it seems to be no "default" value when column is added.
You simply cannot alter the primary key of a Cassandra table. You need to create another table with your new schema and perform a data migration. I would suggest that you use Spark for that since it is really easy to do a migration between two tables with only a few lines of code.
This also answer to the alter primary key question.
If you have not a lot of data in table there is another way.
In utility "DataStax Dev Center", select table and use command "Export All result to file as INSERT". It will save all data from table to file with Insert CQL-instructions.
Then you should drop table, create new one with new PARTITION KEY and finally fill it by instructions from file via CQL.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

Resources