cassandra composite index and compact storages - cassandra

I am new in cassandra, have not run it yet, but my business logic requires to create such table.
CREATE TABLE Index(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, keyword, score); )
WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
Is it possible or not? I have only one column(fID) which is not part of my composite index, so i hope I will be able to apply compact_storage setting. Pay attention thet I ordered by third column of my composite index, not second. I need to compact the storage as well, so the keywords will not be repeated for each fID.

A few things initially about your CREATE TABLE statement:
It will error on the semicolon (;) after your PRIMARY KEY definition.
You will need to pick a new name, as Index is a reserved word.
Pay attention thet I ordered by third column of my composite index, not second.
You cannot skip a clustering key when you specify CLUSTERING ORDER.
However, I do see an option here. Depending on your query requirements, you could simply re-order keyword and score in your PRIMARY KEY definition, and then it would work:
CREATE TABLE giveMeABetterName(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, score, keyword)
) WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
That way, you could query by user_id and your rows (keywords?) for that user would be ordered by score:
SELECT * FROM giveMeABetterName WHERE `user_id`=1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4;
If that won't work for your business logic, then you might have to retouch your data model. But it is not possible to skip a clustering key when specifying CLUSTERING ORDER.
Edit
But re-ordering of columns does not work for me. Can I do something like this WITH CLUSTERING ORDER BY (keyword asc, score desc)
Let's look at some options here. I created a table with your original PRIMARY KEY, but with this CLUSTERING ORDER. That will technically work, but look at how it treats my sample data (video game keywords):
aploetz#cqlsh:stackoverflow> SELECT * FROM givemeabettername WHERE user_id=dbeddd12-40c9-4f84-8c41-162dfb93a69f;
user_id | keyword | score | fid
--------------------------------------+------------------+-------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Assassin's creed | 87 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Battlefield 4 | 9 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Uncharted 2 | 91 | 0
(3 rows)
On the other hand, if I alter the PRIMARY KEY to cluster on score first (and adjust CLUSTERING ORDER accordingly), the same query returns this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
Note that you'll want to change the data type of score from TEXT to a numeric (int/bigint) to avoid ASCII-betical sorting, like this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
Something that might help you, is to read through this DataStax doc on Compound Keys and Clustering.

Related

How can I filter for a specific date on a CQL timestamp column?

I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?
You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

Cassandra MAX function returning mismatched rows

Hi I am trying to get the max coauthor publication from a table in Cassandra, however its returning me mismatched rows when I query
select coauthor_name, MAX(num_of_colab) AS max_2020 from coauthor_by_author where pid = '40/2499' and year=2020;.
It returns:
which is wrong because 9 belongs to another coauthor.
Here is my create statement for the table:
CREATE TABLE IF NOT EXISTS coauthor_by_author (
pid text,
year int,
coauthor_name text,
num_of_colab int,
PRIMARY KEY ((pid), year, coauthor_name, num_of_colab)
) WITH CLUSTERING ORDER BY (year desc);
As proof, here is part of the original table:
As you can see Abdul Hanif Bin Zaini number publication as coauthor should only be 1.
The MAX() function is working as advertised but I think your understanding of how it works is incorrect. Let me illustrate with an example.
Here is the schema for my table of authors:
CREATE TABLE authors_by_coauthor (
author text,
coauthor text,
colabs int,
PRIMARY KEY (author, coauthor)
)
Here is a sample data of authors, their corresponding co-authors, and the number of times they collaborated:
author | coauthor | colabs
---------+-----------+--------
edda | ramakanta | 5
edda | ruzica | 9
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
cassius | ceadda | 14
cassius | flaithri | 13
Anita has three co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'anita';
author | coauthor | colabs
--------+----------+--------
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
And the top number of collaborations for Anita is 12:
SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'anita';
system.max(colabs)
--------------------
12
Similarly, Cassius has two co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'cassius';
author | coauthor | colabs
---------+----------+--------
cassius | ceadda | 14
cassius | flaithri | 13
with 14 as the most collaborations:
cqlsh> > SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'cassius';
system.max(colabs)
--------------------
14
Your question is incomplete since you haven't provided the full sample data but I suspect you're expecting to get the name of the co-author with the most collaborations. This CQL query will NOT return the result you're after:
SELECT coauthor_name, MAX(num_of_colab)
FROM coauthor_by_author
WHERE ...
In SELECT coauthor_name, MAX(num_of_colab), you are incorrectly assuming that the result of MAX(num_of_colab) corresponds to the coauthor_name. Aggregate functions will only ever return ONE row so the result set only ever contains one co-author. The co-author Abdul ... just happens to be the first row in the result so is listed with the MAX() output.
When using aggregate functions, it only makes sense to specify the function in the SELECT statement on its own:
SELECT function(col_name) FROM table WHERE ...
Specifying other columns in the query selectors is meaningless with aggregate functions. Cheers!

Get the last 100 rows from cassandra table

I have a table in cassandra now i cannot select the last 200 rows in the table.
The clustering order by clause was supposed to enforce sorting on disk.
CREATE TABLE t1(id int ,
event text,
receivetime timestamp ,
PRIMARY KEY (event, id)
) WITH CLUSTERING ORDER BY (id DESC)
;
The output is unsorted by id:
event | id | receivetime
---------+----+---------------------------------
event1 | 1 | 2021-07-12 08:11:57.702000+0000
event7 | 7 | 2021-05-22 05:30:00.000000+0000
event5 | 5 | 2021-05-25 05:30:00.000000+0000
event9 | 9 | 2021-05-22 05:30:00.000000+0000
event2 | 2 | 2021-05-21 05:30:00.000000+0000
event10 | 10 | 2021-05-23 05:30:00.000000+0000
event4 | 4 | 2021-05-24 05:30:00.000000+0000
event6 | 6 | 2021-05-27 05:30:00.000000+0000
event3 | 3 | 2021-05-22 05:30:00.000000+0000
event8 | 8 | 2021-05-21 05:30:00.000000+0000
How do I overcome this problem?
Thanks
The same question was asked on https://community.datastax.com/questions/11983/ so I'm re-posting my answer here.
The rows within a partition are sorted based on the order of the clustering column, not the partition key.
In your case, the table's primary key is defined as:
PRIMARY KEY (event, id)
This means that each partition key can have one or more rows, with each row identified by the id column. Since there is only one row in each partition, the sorting order is not evident. But if you had multiple rows in each partition, you'd be able to see that they would be sorted. For example:
event | id | receivetime
---------+----+---------------------------------
event1 | 7 | 2021-05-22 05:30:00.000000+0000
event1 | 5 | 2021-05-25 05:30:00.000000+0000
event1 | 1 | 2021-07-12 08:11:57.702000+0000
In the example above, the partition event1 has 3 rows sorted by the ID column in reverse order.
In addition, running unbounded queries (no WHERE clause filter) is an anti-pattern in Cassandra because it requires a full table scan. If you consider a cluster which has 500 nodes, an unbounded query has to request all the partitions (records) from all 500 nodes to return the result. It will not perform well and does not scale. Cheers!
The ordering for a clustering order, is the order within a single partition key value, e.g. all of the rows for event1 would be in order for event1. It is not a global ordering.
From your results we can see you are selecting multiple partitions - which is why you are not seeing an order you expect.

Query Cassandra with Both Primary Key and Secondary Key Constraints

I have a table in Cassandra defined as
CREATE TABLE foo ("A" text, "B" text, "C" text,
"D" text, "E" text, "F" text,
PRMIARY KEY ("A", "B"),
INDEX ("C"))
I inserted billions of records into this table. And now I want to query the table with CQL
SELECT * FROM foo WHERE "A"='abc' AND "B"='def' AND "C"='ghi'
I keep receiving 1200 error saying that
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica
nodes' responses] message="Operation timed out - received only 0
responses." info={'received_responses': 0, 'required_responses': 1,
'consistency': 'ONE'}
After googling, I suspect the reason of this error is that the query is directed to some partitions that does not hold any data.
My questions are
Is there any constraint querying CQL with both primary key and secondary key specified?
If I specified the partition key in my CQL, here "A"='abc' (correct me if wrong), why C* still tries other partition that apparently does not hold the data?
Any hints to solve this timeout problem?
Thank you!
Note: For my examples, I got rid of the double-quotes around the column names. It really doesn't do anything other than preserve case in the column names (not the values) and only just serves to muck-up the works.
Is there any constraint querying CQL with both primary key and secondary key specified?
First of all, I need to clear-up what, exactly, your "primary key" and "secondary key" are. If you are referring to C as a "secondary key," then "yes" you can, with some restrictions. If you mean your partition key (A) and your cluster key (B), then yes, you can.
Querying by your partition and clustering keys (or even just your partition key(s) works:
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERe A='abc' AND B='def';
a | b | c | d | e | f
-----+-----+-----+-----+-----+-----
abc | def | ghi | jkl | mno | pqr
(1 rows)
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERe A='abc';
a | b | c | d | e | f
-----+-----+-----+-----+-----+-----
abc | ddd | ghi | jkl | mno | pqr
abc | def | ghi | jkl | mno | pqr
(2 rows)
When I create your table and index, insert a few rows, and run your query:
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERE A='abc' AND B='def' AND C='ghi';
a | b | c | d | e | f
-----+-----+-----+-----+-----+-----
abc | def | ghi | jkl | mno | pqr
(1 rows)
That works.
If I specified the partition key in my CQL, here "A"='abc' (correct me if wrong), why C* still tries other partition that apparently does not hold the data?
I don't believe that is the problem. You are restricting it to a single partition, so it should only query data off of the abc partition.
I inserted billions of records into this table.
What you are seeing, is the reason that secondary index usage is considered to be an "anti-pattern" in Cassandra. Secondary indexes do not work the same way that they do in the relational world. They just do not scale well to large clusters or data sets.
Any hints to solve this timeout problem?
Yes. Recreate your table with C as a second clustering key. And do not create an index on C.
CREATE TABLE foo (A text, B text, C text, D text, E text, F text,
PRMIARY KEY (A, B, C));
Reload your data, and then this should work for you:
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERE A='abc' AND B='def' AND C='ghi';
Not only should it work, but it should not timeout and it should be fast.

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources