Is it possible to use a hash type secondary index in scylladb? - cassandra

Table for explanation:
CREATE TABLE test
(
id INT,
uniuque_string VARCHAR,
another_id INT,
PRIMARY KEY ((id, uniuque_string))
);
Sometimes it is necessary to make such requests:
SELECT * FROM test WHERE another_id = 12;
but another_id is not a primary key.
How can I create a hash index (for example, like in mysql) by this column?

Scylla and Cassandra have two features that fit your needs secondary index and materialized view:
You enable a secondary index on another_id with the command:
CREATE INDEX ON test (another_id);
And now your query SELECT * FROM test WHERE another_id = 12; will work normally. You should be aware that if this SELECT returns a huge number of results (a so-called low-cardinality index) this select is inefficient - it retrieves just the keys from the index, and to show everything (the SELECT *) Scylla needs to go back to the table and fetch its rows one by one, a fairly inefficient process.
The second alternative is a materialized view with another_id as a partition key, as in the command:
CREATE MATERIALIZED VIEW test_by_another_id AS
SELECT * FROM test
WHERE another_id IS NOT NULL AND uniuque_string IS NOT NULL and id IS NOT NULL
PRIMARY KEY (another_id, id, uniuque_string);
This materialized view will contain a copy of all the data (not just keys) from test, and can be searched by another key. For example: SELECT * FROM test_by_another_id WHERE another_id = 12; (note we had to replace test by the viewtest_by_another_id.
The materialized-view solution will make long selects much more efficient than the secondary-index solution, but it requires more storage (all the data is stored twice), so both solutions should be considered depending on your use case.

Related

Data modelling to faciliate pruning/bulk update/delete in scylladb/cassandra

Lets say I have a table like below with a composite partition key.
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
Lets say there is a batch job to prune all the data older than X. I can't do below query since its missing other partition key in the query.
DELETE FROM heartrate WHERE date < '2020-01-01';
How do you model your data such a way that this can be achieved in Scylla? I understand that internally scylla creates a partition based on partition keys but in this case its impossible to query all the list of pet_chip_id and do N queries to delete.
Just wanted to know how people do this outside RDBMS world.
The recommended way to delete old data automatically in Scylla is using the Time-to-live (TTL) feature:
When you write a row, you add "USING TTL 864000" is you want that data to be deleted automatically in 10 days. You can also specify a default TTL for a given table, so that every piece of data written to the table will get expired after (say) 10 days.
Scylla's TTL feature is separate from the data itself, so it doesn't matter which columns you used as partition keys or clustering keys - in particular the "date" column no longer needs to be a clustering key (or exist at all, for that matter) - unless you also need it for something else.
As #nadav-harel said in his answer if you can define a TTL that's always the best solution but if you can't, a possible solution is to create a materialized view to be able to list the primary keys of the main table based on the field that you need to use in the delete query. In the prune job you can first do a select from the MV and then delete from the main table using the values that you got from the MV.
Example:
CREATE TABLE my_table (
a uuid,
b text,
c text,
d int,
e timestamp
PRIMARY KEY ((a, b), c)
);
CREATE MATERIALIZED VIEW my_mv AS
SELECT a,
b,
c
FROM my_table
WHERE time IS NOT NULL
PRIMARY KEY (b, a, c);
Then in your prune job you could select from my_mv based on b and then delete from my_table based on the values returned from the select query.
Note that this solution might not be effective depending on your model and the amount of data you have, but keep in mind that deleting data is also a way of querying your data and your model should be defined based on your queries needs, i.e. before defining your model, you need to think about every way you will query it (including how you will prune your data).

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Cassandra seconday index vs materialized view

I'm modeling my table for Cassandra 3.0+. The objective is to build a table that store user's activities, here what i've done so far:
(userid come from another database Mysql)
CREATE TABLE activity (
userid int,
type int,
remoteid text,
time timestamp,
imported timestamp,
visibility int,
title text,
description text,
img text,
customfields MAP<text,text>,
PRIMARY KEY (userid, type, remoteid, time, imported))
This are the main queries that i use:
SELECT * FROM activity WHERE userid = ? AND remoteid = ?;
SELECT * FROM activity WHERE userid = ? AND type = ? AND LIMIT 10;
Now i need to add the column visibility on the second query. So, from what i've learned around, i can choose between a secondary index or a materialized view.
This are the facts:
Here i've one partition per user and inside there are thousands of rows (activities).
I use always the partition key (userid) in all my query to access the data.
the global number of activities are 30 millions, growing up.
visibility column has low cardinality (just 3 value) and could be updated, but rarely.
So what should i choose? materialized view or index? I know that index with low cardinality are bad choice, but my query include always the partition key and a limit, so maybe is not that bad.
If you are always going to use the partition key I recommend using secondary indexes.
Materialized views are better when you do not know the partition key
References:
Principal Article!
• Cassandra Secondary Index Preview #1
Here is a comparison with the Materialized Views and the secondary indices
• Materialized View Performance in Cassandra 3.x
And here is where the PK is known is more effective to use an index
• Cassandra Native Secondary Index Deep Dive

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

Order by with Cassandra No Sql Db

I'm starting to using Cassandra but I'm getting some problems on "ordering" or "selecting".
CREATE TABLE functions (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_function, sort, id_subfunction)
);
This is my table.
If I execute this query
SELECT * FROM functions WHERE id_subfunction = 0 ORDER BY sort;
this is what I get.
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Where I'm doing wrong?
Thanks
PRIMARY KEY (id_function, sort, id_subfunction)
In Cassandra CQL the columns in a compound PRIMARY KEY are either partitioning keys or clustering keys. In your case, id_function (the first key listed) is the partitioning key. This is the key value that is hashed so that your data for that key can be evenly distributed on your cluster.
The remaining columns (sort and id_subfunction) are known as clustering columns, which determine the sort order of your data within a partition. This essentially means that your data will only be sorted by your clustering key(s) when a partitioning key is first designated in your WHERE clause.
You have two options:
1) Query this table by id_function instead:
SELECT * FROM functions WHERE id_function= 0 ORDER BY sort;
This will technically work, although I'm guessing that it won't give you the results that you are looking for.
2) The better option, is to create a "query table." This is a table designed to specifically handle your query by id_subfunction. It only differs from the original functions table in that the PRIMARY KEY is defined with id_subfunction as the partitioning key:
CREATE TABLE functionsbysubfunction (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_subfunction, sort, id_function)
);
This query table will allow this query to function as expected:
SELECT * FROM functionsbysubfunction WHERE id_subfunction = 0;
And you shouldn't need to indicateORDER BY, unless you want to specify either ASCending or DESCending order.
Remember with Cassandra, it is important to design your data model according to how you want to query your data. And that may not necessarily be the way that it originally makes sense to store it.

Resources