Cassandra seconday index vs materialized view - cassandra

I'm modeling my table for Cassandra 3.0+. The objective is to build a table that store user's activities, here what i've done so far:
(userid come from another database Mysql)
CREATE TABLE activity (
userid int,
type int,
remoteid text,
time timestamp,
imported timestamp,
visibility int,
title text,
description text,
img text,
customfields MAP<text,text>,
PRIMARY KEY (userid, type, remoteid, time, imported))
This are the main queries that i use:
SELECT * FROM activity WHERE userid = ? AND remoteid = ?;
SELECT * FROM activity WHERE userid = ? AND type = ? AND LIMIT 10;
Now i need to add the column visibility on the second query. So, from what i've learned around, i can choose between a secondary index or a materialized view.
This are the facts:
Here i've one partition per user and inside there are thousands of rows (activities).
I use always the partition key (userid) in all my query to access the data.
the global number of activities are 30 millions, growing up.
visibility column has low cardinality (just 3 value) and could be updated, but rarely.
So what should i choose? materialized view or index? I know that index with low cardinality are bad choice, but my query include always the partition key and a limit, so maybe is not that bad.

If you are always going to use the partition key I recommend using secondary indexes.
Materialized views are better when you do not know the partition key
References:
Principal Article!
• Cassandra Secondary Index Preview #1
Here is a comparison with the Materialized Views and the secondary indices
• Materialized View Performance in Cassandra 3.x
And here is where the PK is known is more effective to use an index
• Cassandra Native Secondary Index Deep Dive

Related

Is it possible to use a hash type secondary index in scylladb?

Table for explanation:
CREATE TABLE test
(
id INT,
uniuque_string VARCHAR,
another_id INT,
PRIMARY KEY ((id, uniuque_string))
);
Sometimes it is necessary to make such requests:
SELECT * FROM test WHERE another_id = 12;
but another_id is not a primary key.
How can I create a hash index (for example, like in mysql) by this column?
Scylla and Cassandra have two features that fit your needs secondary index and materialized view:
You enable a secondary index on another_id with the command:
CREATE INDEX ON test (another_id);
And now your query SELECT * FROM test WHERE another_id = 12; will work normally. You should be aware that if this SELECT returns a huge number of results (a so-called low-cardinality index) this select is inefficient - it retrieves just the keys from the index, and to show everything (the SELECT *) Scylla needs to go back to the table and fetch its rows one by one, a fairly inefficient process.
The second alternative is a materialized view with another_id as a partition key, as in the command:
CREATE MATERIALIZED VIEW test_by_another_id AS
SELECT * FROM test
WHERE another_id IS NOT NULL AND uniuque_string IS NOT NULL and id IS NOT NULL
PRIMARY KEY (another_id, id, uniuque_string);
This materialized view will contain a copy of all the data (not just keys) from test, and can be searched by another key. For example: SELECT * FROM test_by_another_id WHERE another_id = 12; (note we had to replace test by the viewtest_by_another_id.
The materialized-view solution will make long selects much more efficient than the secondary-index solution, but it requires more storage (all the data is stored twice), so both solutions should be considered depending on your use case.

Delete whole row based on one of clusturing column value in cassandra

Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?
In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...
You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.

If not MaterializedViews and not secondary indices then what else is the recommended way to query data in cassandra

I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.
As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
It’s absolutely fine to duplicate data based on the use case (query) it’s trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.

Creating secondary index on table in Cassandra

I have just started working on Cassandra.
I am bit confuse with the concept of secondary key.
From the definition I understood is indexing on the non key attribute of a table which is not sorted is secondary index.
So I have this table
CREATE TABLE IF NOT EXISTS userschema.user (id int,name text, address text, company text, PRIMARY KEY (id, name))
So If I create index like this
CREATE INDEX IF NOT EXISTS user_name_index ON userschema.user (name)
this should be secondary index.
But my requirement is to create index containing columns name , id , company.
How can I create a secondary index like this in Cassandra ?
I got this link which defines something of this short, but how come are these secondary indexes aren't they just table ?
These above user table is just the example not the actual one.
I am using Cassandra 3.0.9
id and name are already part of primary key.
So following queries will work
SELECT * FROM table WHERE id=1
SELECT * FROM table WHERE id=1 and name='some value'
SELECT * FROM table WHERE name='some value' ALLOW FILTERING (This is inefficeint)
You can create secondary index on company column
CREATE INDEX IF NOT EXISTS company_index ON userschema.user (company)
Now once secondary index is defined, it can be used in where clause along with primary key.
SELECT * FROM table WHERE id=1 and name='some value' and company='some value'
Though SELECT * FROM table WHERE company='some value' ALLOW FILTERING works it will be highly inefficient.
Before creating secondary index have look at When to use secondary index in cassandra
The link which you have referred mainly focuses on materialized views, in which we create virtual tables to execute the queries with non-primary keys.
Moreover, it seems you are creating secondary key on a Primary Key, which you have already defined in the creation of the table. Always remember that Secondary Index should be Non-Primary key.
To have a clear idea about the Secondary Indexes- Refer this https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSecondaryIndex.html
Now, Pros and cons of the alternative methods for the secondary index
1.Materialized views:
It will create new virtual tables and you should run the queries in a virtual table using the old Primary keys in old and original tables and new virtual Primary keys in the new materialized table. Any changes in data modification in the original old table will be reflected at materialized table. If you drop the materialized table, but the data will be created as tombstones whose gcc_graceseconds is 864000(10 days) default. Dropping the materialized table will not have any effect on original table.
2.ALLOW FILTERING:
It is highly inefficient and is not at all advised to use allow filtering as the latencies will be high and performance will be degraded.
If you want much more information, refer this link too How do secondary indexes work in Cassandra?
Correct me if I am wrong

Cassandra sorting the results by non-clustering key

Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition
CREATE TABLE blogs_by_visitor (
blogposturl text,
visitor text,
visited_ts timestamp,
PRIMARY KEY (blogposturl, visitor)
);
Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.
In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.
Select visitor from blog_table
where
blogposturl = ?
and rownum <= 10
order by timestamp desc
An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.
I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).
Any idea on how we could model the table differently?
The actual table has additional columns, reduced it here for simplicity
These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.
Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.
Let's create the Materialized View :
CREATE MATERIALIZED VIEW unique_visitor AS
SELECT *
FROM blogs_by_visitor
WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
PRIMARY KEY (blogposturl, visited_ts, visitor)
WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);
Now you can just select the 10 recent unique visitor of a blogpost.
SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;
you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC
Note That : The above schema will result huge amount of unexpected tombstone generation in the Materialized Views
Or You could change your table schmea like below :
CREATE TABLE blogs_by_visitor (
blogposturl text,
year int,
month int,
day int,
visitor text,
visited_ts timestamp,
PRIMARY KEY ((blogposturl, year, month, day), visitor)
);
Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.

Resources