Cassandra how to add clustering key in table?

Cassandra how to add clustering key in table? - cassandra

There is a table in cassandra
create table test_moments(id Text, title Text, sort int, PRIMARY KEY(id));
How add clustering key in column "sort". Not re-creating the table

The main problem is the on-disk data structure. Clustering key directly dictates how data is sorted and serialized to disk (and then searched), so what you're asking is not possible.
The only way is to "migrate" the data to another table. Depending on your data, if you have a lot of records you could encounter some timeout error during the queries, so be prepared to tweak your migration with some useful techniques such as the COPY command or the TOKEN function.
Have a look at this SO question also.

All you need to do is add it as the second part of the PRIMARY KEY to make it a composite key
create table test_moments(id Text, title Text, sort int, PRIMARY KEY(id, sort));

Related

Delete whole row based on one of clusturing column value in cassandra

Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?

In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...

You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.

If not MaterializedViews and not secondary indices then what else is the recommended way to query data in cassandra

I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.

As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
It’s absolutely fine to duplicate data based on the use case (query) it’s trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.

Cassandra sort not by primary key

I'm trying to model a table in Cassandra, I'm quite new and stumbled upon one problem. I've got the following:
CREATE TABLE content_registry (
service text,
file text,
type_id tinyint,
container text,
status_id tinyint,
source_location text,
expiry_date timestamp,
modify_date timestamp,
create_date timestamp,
to_overwrite boolean,
PRIMARY KEY ((service), file, type_id)
);
So as I understand:
service is my partition key and based on this value hashes will be generated and values will be split in cluster
file is clustering key
type_id is clustering key
These three bodies combine a composite (compound) primary key
What I've figured out is that whenever I'll insert new data, Cassandra will upsert (either insert or update if the value with that compound primary key exists)
Now what I'm struggling is, that I want my data to come back sorted by create_date in descending order, however create_date is not part of primary key.
If I add create_date to my primary key, I won't be able to upsert data, because create_date means timestamp when record was inserted, so if I add it to primary key every time there's an insert, I'll end up with multiple records.
What are the other options? Order in application? That doesn't seem very efficient.

What I've figured out is that whenever I'll insert new data, Cassandra
will upsert (either insert or update if the value with that compound
primary key exists)
Totally right.
Now what I'm struggling is, that I want my data to come back sorted by
create_date in descending order, however create_date is not part of
primary key.
If I add create_date to my primary key, I won't be able to upsert
data, because create_date means timestamp when record was inserted, so
if I add it to primary key every time there's an insert, I'll end up
with multiple records.
With these sentences you are actually contradicting.
If create_date isn't part of your key but a property and the data is upserted, it means that the records are always the same. Therefore when querying by the key and fetching create_date you always have the latest. If you actually want to have the date when the record got created you should just not override the data anymore after the first time you inserted that record.
If it's the case you want to represent a series of data, you indeed need to avoid upserting, this is could be done by using create_date as additional partition key. I'd rather prefeer using time_uuid which comes with quite handy functions.
Last but not least, the most interesting question is, what actually the usecase is that you want to reflect. When modelling data in cassandra you always should know your queries you need to run in advance.

The key concept in Cassandra is that you have to decide what's your PRIMARY KEY, that is what in your rows can be unique and known at query times. This is a very basic requirement, since failing at recognizing this will lead to a bad model.
From what I can see, you identified service as your PARTITION KEY, so I'm thinking that this field is what "rules" your data. This is something you must really know to perform even a single query (ignoring the inefficient table scan SELECT * FROM content_registry;). Within each service, you currently have your rows ordered by file and then by type_id. I don't know the exact meaning of the latter field, but you can currently have two rows identified by ('service1', 'a.jpg', 1) and ('service1', 'a.jpg', 2). So if type_id is somehow related to the file, the model is a bit incorrect.
Now, assuming you want to fetch the same records for each service in another order, what you really need to do is create another table that will include the create_date as the first clustering column, eg (service, create_date, file, type_id). This will allow you to fetch records ordered by creation date, and when two records are created in the same date, they will be further ordered by file, and then by type_id.
A second approach is to attach a secondary index to the create_date field of your original table. This will allow to query by creation date.
A third approach, probably better than the second, is the use of a Materialized View. It will hide a lot of burdens for you and will probably scale better than secondary indexes.
Please note that having secondary indexes or materialized views usually don't scale well. Check if these approaches are enough for your use case.

If I add create_date to my primary key, I won't be able to upsert data.
Why not? Suppose your key was PRIMAY KEY (service, create_date, file, type_id)? That will let you sort by create_date for each service but not globally.
If you want to do it globally (that is, you want all services and all files sorted by create date) then things are probably more complex if you still want to be able to shard your data. One option would be to make the primary key PRIMARY KEY (create_date, service, file, type_id) and use one of the order preserving partitioners.
Also, a bit more information here: http://www.datastax.com/dev/blog/we-shall-have-order

Cassandra Defining Primary key and alternatives

Here is a simple example of the user table in cassandra. What is best strategy to create a primary key.
My requirements are
search by uuid
search by username
search by email
All the keys mentioned will be high cardinality keys. Also at any moment I will be having only one of them to search
PRIMARY KEY(uid,username,email)
What if I have only the username ?, Then the above primary key is not use ful. I am not able visualize a solution to achieve this using compound primary key?
what are other options? should we go with a new table with username to uid, then search the user table. ?
From all articles out there on the internet recommends not to create secondary index for high cardinality keys
CREATE TABLE medicscity.user (
uid uuid,
fname text,
lname text,
user_id text,
email_id text,
password text,
city text,
state_id int,
country_id int,
dob timestamp,
zipcode text,
PRIMARY KEY (??)
)
How do we solve this kind of situation ?

Yes, you need to go with duplicate tables.
If ever in Cassandra you face a situation in which you will have to query a table based on column1, column2 or column3 independently. You will have to duplicate the tables.
Now, how much duplication you have to use, is individual choice.
Like, in this example, you can either duplicate table with full data.
Or, you can simply create a new table column1 (partition), column2, column 3 as primary key in main table.
Create a new table with primary key of column1, column2, column3 and partition key on column2.
Another one with same primary key and partition key on column3.
So, your data duplicate will be row, but in this case you will end up querying data twice. One from duplicate table, and one from full fledged table.
Big data technology, is there to speed up computation and let your system scale horizontally, and it comes at the expense of disk/storage. I mean just look at everything, even its base of replication factor does duplication of data.

Your PRIMARY KEY(uuid,username,email) don't fit your requirement. Because you can't search for the clustering column without fill the Partition Key, and even the second clustering column without fill the first clustering column.
e.g. you cannot search for username without uuid in WHERE clause and cannot search for email without uuid and username too.
All you need is the denormalization and duplicate data.
Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.
In your case, you need to create 3 tables that have the same column (data that you want to get), but these 3 tables will have different PRIMARY KEY, one have uuid as PK, one have username as PK, and one have email as PK. :)

How to change PARTITION KEY column in Cassandra?

Suppose we have such table:
create table users (
id text,
roles set<text>,
PRIMARY KEY ((id))
);
I want all the values of this table to be stored on the same Cassandra node (OK, not really the same, same 3, but have all the data mirrored, but you got the point), so to achieve that i want to change this table to be like this:
create table users_v2 (
partition int,
id text,
roles set<text>,
PRIMARY KEY ((partition), id)
);
How can i do that without losing the data from the first table?
It seems to be impossible to ALTER TABLE in order to add such column. i'm OK with that.
What i try to do is to copy data from the first table and insert to the second table.
When i do it as it is, the partition column іs missing, which is expected.
I can ALTER the first table and add a 'partition' column to the end, and then COPY in correct order, but i can't update all the rows in the first table to set the all some partition, and it seems to be no "default" value when column is added.

You simply cannot alter the primary key of a Cassandra table. You need to create another table with your new schema and perform a data migration. I would suggest that you use Spark for that since it is really easy to do a migration between two tables with only a few lines of code.
This also answer to the alter primary key question.

If you have not a lot of data in table there is another way.
In utility "DataStax Dev Center", select table and use command "Export All result to file as INSERT". It will save all data from table to file with Insert CQL-instructions.
Then you should drop table, create new one with new PARTITION KEY and finally fill it by instructions from file via CQL.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string