Cassandra Data Modelling approach - cassandra

I have below initially designed static column family in cassandra
create table APP_DATA (
CODE varchar,
DATA varchar,
CREATED_DT timestamp,
REQUEST_TYPE int,
STATUS int,
..... #Some more columns ...,
PRIMARY KEY ((CODE,DATA),CREATED_DT))
with clustering order by (CREATED_DT desc);
Now, I want to query the below
1)SELECT
SELECT * FROM APP_DATA WHERE CODE='1' AND DATA='1111111111';
SELECT * FROM APP_DATA WHERE CODE='1' AND DATA='1111111111' AND CREATED_DT<=dateof(now()) AND STATUS=0;
SELECT * FROM APP_DATA WHERE CODE='1' AND DATA='1111111111' AND CREATED_DT<=dateof(now()) AND STATUS=0 AND REQUEST_TYPE=9;
2)DELETE
DELETE FROM APP_DATA WHERE CREATED_DT+5<=sysdate;
How should I proceed with data modeling ?
How should I design to make the above select and delete queries faster ?
Please guide ..
Thanks in Advance.

Hi First of all take CREATED_DT column out of PRIMARY KEY, and left with two column in PRIMARY KEY. Make CREATED_DT as a normal column and create secondary indexes to query.
Second to delete the data which is older than five days (CREATED_DT+5 <= sysdate) use TTL (Time to live) feature of Cassandra.
I hope it could help you.

Here is the thing. I thing your table looks good and you do need to take out CREATED_DT out of the primary key because you are grouping by it as DESC. And, in order to do that you have to make it a clustering column.
Secondly Cassandra practices is a query driven methodology. Meaning you create a table to satisfy a query. Try to avoid creating Secondary indexes as much as you can and create tables instead to satisfy the query.
You DML should be based on partition key.

Related

Data modelling to faciliate pruning/bulk update/delete in scylladb/cassandra

Lets say I have a table like below with a composite partition key.
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
Lets say there is a batch job to prune all the data older than X. I can't do below query since its missing other partition key in the query.
DELETE FROM heartrate WHERE date < '2020-01-01';
How do you model your data such a way that this can be achieved in Scylla? I understand that internally scylla creates a partition based on partition keys but in this case its impossible to query all the list of pet_chip_id and do N queries to delete.
Just wanted to know how people do this outside RDBMS world.
The recommended way to delete old data automatically in Scylla is using the Time-to-live (TTL) feature:
When you write a row, you add "USING TTL 864000" is you want that data to be deleted automatically in 10 days. You can also specify a default TTL for a given table, so that every piece of data written to the table will get expired after (say) 10 days.
Scylla's TTL feature is separate from the data itself, so it doesn't matter which columns you used as partition keys or clustering keys - in particular the "date" column no longer needs to be a clustering key (or exist at all, for that matter) - unless you also need it for something else.
As #nadav-harel said in his answer if you can define a TTL that's always the best solution but if you can't, a possible solution is to create a materialized view to be able to list the primary keys of the main table based on the field that you need to use in the delete query. In the prune job you can first do a select from the MV and then delete from the main table using the values that you got from the MV.
Example:
CREATE TABLE my_table (
a uuid,
b text,
c text,
d int,
e timestamp
PRIMARY KEY ((a, b), c)
);
CREATE MATERIALIZED VIEW my_mv AS
SELECT a,
b,
c
FROM my_table
WHERE time IS NOT NULL
PRIMARY KEY (b, a, c);
Then in your prune job you could select from my_mv based on b and then delete from my_table based on the values returned from the select query.
Note that this solution might not be effective depending on your model and the amount of data you have, but keep in mind that deleting data is also a way of querying your data and your model should be defined based on your queries needs, i.e. before defining your model, you need to think about every way you will query it (including how you will prune your data).

Why does querying based on the first clustering key require an ALLOW FILTERING?

Say I have this Cassandra table:
CREATE TABLE orders (
customerId int,
datetime date,
amount int,
PRIMARY KEY (customerId, datetime)
);
Then why would the following query require an ALLOW FILTERING:
SELECT * FROM orders WHERE date >= '2020-01-01'
Cassandra could just go to all the individual partitions (i.e. customers) and filter on the clustering key date. Since date is sorted there is no need to retrieve all the rows in orders and filter out the ones that match my where clause (as far as I understand it).
I hope someone can enlighten me.
Thanks
This happens because for normal work, Cassandra needs the partition key - it's used to find what machine(s) are storing the data for it. If you don't have partition key, like, in your example, Cassandra need to scan all data to find those that are matching your query. And this requires the use of the ALLOW FILTERING.
P.S. Data is sorted only inside the individual partitions, not globally.

Delete whole row based on one of clusturing column value in cassandra

Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?
In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...
You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.

Cassandra how to add clustering key in table?

There is a table in cassandra
create table test_moments(id Text, title Text, sort int, PRIMARY KEY(id));
How add clustering key in column "sort". Not re-creating the table
The main problem is the on-disk data structure. Clustering key directly dictates how data is sorted and serialized to disk (and then searched), so what you're asking is not possible.
The only way is to "migrate" the data to another table. Depending on your data, if you have a lot of records you could encounter some timeout error during the queries, so be prepared to tweak your migration with some useful techniques such as the COPY command or the TOKEN function.
Have a look at this SO question also.
All you need to do is add it as the second part of the PRIMARY KEY to make it a composite key
create table test_moments(id Text, title Text, sort int, PRIMARY KEY(id, sort));

Cassandra select with clustering key range that might not have primary key

Sorry the title might/might not give exact description of what i intended.
Here is the problem. I need to select data based on date ranges and most of our queries have 'id' field that is used in our queries.
So, i have created data model with the id as primary key, and date as clustering key.
Essentially like below(i am just using fake/sample statements as i cannot give actual details).
create table tab1(
id text,
col1 text,
... coln text,
rec_date date,
rec_time timestamp,
PRIMARY KEY((id),rec_date,rec_time)
) WITH CLUSTERING ORDER BY rec_date DESC, rec_time DESC;
It works for most of the queries and worked fine.
However, i was trying to optimize below scenario.
-> All the records that are greater than the date abcd-xy-kl
Which one of the below approaches would be good for me.? Or any thing better than these two.?
1) very basic or simple approach. Use the query:
select * from tab1 where id > '0' AND rec_date > 'abcd-xy-kl'
Every record will be essentially greater than '0'. It might still do full table scan.
2) Create secondary index on rec_date and simply use the query:
select * from tab1 where rec_date > 'abcd-xy-kl'
Also, one key thing is i am using spark and using cassandraSqlContext.sql to get the dataframe.
So, considering all the above details, which approach would be better.?
I don't see the point of filtering with id as in your first example. The following should work and would be better approach from my perspective:
select * from tab1 where rec_date > 'abcd-xy-kl' ALLOW FILTERING;
Note that it won't work without ALLOW FILTERING at the end.
You cannot use > 0 for the partition key. It is not supported by Cassandra. Check the documentation for more information on the limitations on the WHERE part of the queries.
In order to query by your clustering keys efficiently you really need to use a secondary index. Refrain from using the ALLOW FILTERING unless you know what you're doing, because it could trigger a "distributed" scan and perform very poorly. Check the documentation for more information.

Resources