Cassandra: Push the filtered rows to a new column family using CQL and delete them from existing column family - cassandra

I'm a newbie to cassandra. I have a confusion with archival of data. Following is the approach I am trying to implement.
Filter the records to be archived.
Create a new column family
Move the filtered records to the new column family
Delete the filtered records from existing column family
Filter the records to be archived. - Achieved with the use of secondary indexes
Create a new column family Create Query
Move the filtered records to the new column family I thought of implementing this by the approach mentioned in cassandra copy data from one columnfamily to another columnfamily But this copies all the data from column family 1 to 2. Is it possible to move only the filtered rows to new column family?
Delete the filtered records from existing column family I am not sure of how to achieve this in CQL. Please help me.
Additionally, Let me know if there is any better approach.

COPY then DELETE sounds like a valid strategy here.
For deleting rows, take a look at the DELETE command, it takes the same WHERE condition as a SELECT does.
Unfortunately this won't work for a query that requires "ALLOW FILTERING", although there is an enhancement request to add this.

Related

How to understand the 'Flexible schema' in Cassandra?

I am new to Cassandra, and found below in the wikipedia.
A column family (called "table" since CQL 3) resembles a table in an RDBMS (Relational Database Management System). Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[29]
It said that 'different rows in the same column family do not have to share the same set of columns', but how to implement it? I have almost read all the documents in the offical site.
I can create table and insert data like below.
CREATE TABLE Emp_record(E_id int PRIMARY KEY,E_score int,E_name text,E_city text);
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (101, 85, 'ashish', 'Noida');
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (102, 90, 'ankur', 'meerut');
It's very like I did in the relational database. So how to create multiply rows with different columns?
I also found the offical document mentioned 'Flexible schema', how to understand it here?
Thanks very much in advance.
Column family is from the original design of Cassandra, when the data model looked like the Google BigTable or Apache HBase, and Thrift protocol was used for communication. But this required that schema was defined inside the application, and that makes access to data from many applications more problematic, as you need to update the schema inside all of them...
The CREATE TABLE and INSERT is a part of the Cassandra Query Language (CQL) that was introduced long time ago, and replaced Thrift-based implementation (Cassandra 4.0 completely removed the Thrift support). In CQL you need to have schema defined for a table, where you need to provide column name & type. If you really need to have dynamic columns, there are several approaches to that (I'll link answers that I already wrote over the time, so there won't duplicates):
If you have values of the same type, you can use one column as a name of the attribute/column, and another to store the value, like described here
if you have values of different types, you can also use one column as a name of attribute/column, and define multiple columns for values - one for each of the data types: int, text, ..., and you insert value into the corresponding columns only (described here)
you can use maps (described here) - it's similar to first or second, but mostly designed for very small number of "dynamic columns", plus have other limitations, like, you need to read the full map to fetch one value, etc.)

Cassandra Altering the table

I have a table in Cassandra say employee(id, email, role, name, password) with only id as my primary key.
I want to ...
1. Add another column (manager_id) in with a default value in it
I know that I can add a column in the table but there is no way i can provide a default value to that column through CQL. I can also not update the value for manager_id later since I need to know the id (Partition key and the values are randomly generated unique values which i don't know) to update the row. Is there any way I can achieve this?
2. Rename this table to all_employee.
I also know that its not allowed to rename a table in cassandra. So I am trying to copy the data of table(employee) to csv and copy from csv to new table (all_employee) and deleting the old table(employee). I am doing this through an automated script with cql queries in it and script works fine but will fail if it gets executed again(Which i can not restrict) since the table employee will not be there once its deleted. Essentially I am looking for "If exists" clause in COPY query which is not supported in cql. Is there any other way I can achieve the outcome?
Please note that the amount of data in the table is very small so performance in not an issue.
For #1
I dont think cassandra support default column . You need to do that from your appliaction. Write some default value every time you insert a row.
For #2
you can check if the table exists before trying to copy from it.
SELECT your_table_name FROM system_schema.tables WHERE keyspace_name='your_keyspace_name';

Cassandra alter column type: best way with non-compatible types?

I have a large table in Cassandra with a column of type int but no values are outside the range 0-10. I want to reduce the table size by changing the type of the column to tinyint.
This is the error I get
[Query invalid because of configuration issue] message="Cannot change COLUMN_NAME from type int to type tinyint: types are not order-compatible.">
Is there a nice way to handle this with a cast or other such query trickery?
If not ... and without taking the database down, is there a better way to solve this than doing the following?
make a new column of type tinyint
update my code to duplicate data to this column during write operations
copy old data to the new column [will take a while probably]
swap the names of the columns
revert my code change (only update one column)
delete the old int column
I would say deleting old columns and copying data to new columns is not ideal.
If your cassandra column family is accessed by a single entry point (service), my suggestion would be,
Add a new column.
Retain the old column. (You can rename it like COLUMNNAME_OBSOLETE).
After updating your code, only populate the data against new column in your code.
While reading data into domain object, if your new column is null then fill it with old column.
In one of our project, we followed the above steps against prod data and it worked fine. After few months, when we weren't need of COLUMNNAME_OBSOLETE we dropped that column.

Does it have any issue if I want to add a new column for big table in cassandra?

In my project, I am using Cassandra to store huge data. With MYSQL big table it will take a long time to add a new column or index. Will Cassandra solve that issue?
Yes it is relatively very easy to add a column and index that column in Cassandra.
Any column added will be propagated to all nodes very fast too. The added column will be initialised with NULL by default

how to define dynamic columns in a column family in Cassandra?

We don't want to fix the columns definition when creating a column family, as we might have to insert new columns into the column family. Is it possible to achieve it? I am wondering whether it is possible to not to define the column metadata when creating a column family, but to specify the column when client updates data, for example:
CREATE COLUMN FAMILY products WITH default_validation_class= UTF8Type AND key_validation_class=UTF8Type AND comparator=UTF8Type;
set products['1001']['brand']= ‘Sony’;
Thanks,
Fan
Yes... it is possible to achieve this, without even taking any special effort. Per the DataStax documentation of the Cassandra data model (a good read, by the way, along with the CQL spec):
The Cassandra data model is a schema-optional, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.

Resources