Cassandra: cannot restrict 2 columns using clustering key(version 2.1.9) - cassandra

I have a schema pretty similar to this:-
create table x(id int, start_date timestamp, end_date timestamp,
primary key((id), start_date, end_date))
with clustering order by (start_date desc, end_date desc);
Now I am stuck with a problem where I have to query between start date and end date. something like this : -
select count(*) from x where id=2 and start_date > 'date' and end_date < 'date' ;
But it gives me an error similar to the following: -
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column "end_date"
cannot be restricted (preceding column "start_date" is restricted
by a non-EQ relation)"
I am new to cassandra, any and all suggestions are welcomed even if it requires us to do a schema change. :)

You don't say which version of Cassandra you are running, but in 2.2 and later you can do multi-column slice restrictions on clustering columns. This can get close to what you want. The syntax in CQL is a little ugly, but basically you have to specify the starting range with all the clustering columns specified, like a compound key. It's important to think about the rows being sorted first by the first column, then within that sorted by the second column.
So assume we have this data:
SELECT * from x;
id | start_date | end_date
----+--------------------------+--------------------------
2 | 2015-09-01 09:16:47+0000 | 2015-11-01 09:16:47+0000
2 | 2015-08-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
2 | 2015-07-01 09:16:47+0000 | 2015-09-01 09:16:47+0000
2 | 2015-06-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
Now let's select based on both dates:
SELECT * from x where id=2
and (start_date,end_date) >= ('2015-07-01 09:16:47+0000','2015-07-01 09:16:47+0000')
and (start_date,end_date) <= ('2015-09-01 09:16:47+0000','2015-09-01 09:16:47+0000');
id | start_date | end_date
----+--------------------------+--------------------------
2 | 2015-08-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
2 | 2015-07-01 09:16:47+0000 | 2015-09-01 09:16:47+0000
Now you'll notice that one of those end dates appears to be later than our restriction, but it isn't. Since things are sorted by start_date first, you'll get all the end dates with a matching start_date since they are in the range of the compound range. To get rid of rows like that you'll probably need to do a little filtering on the client side.
See more information here, under "Multi-column slice restrictions".

Related

How can I filter for a specific date on a CQL timestamp column?

I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?
You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

Get the last 100 rows from cassandra table

I have a table in cassandra now i cannot select the last 200 rows in the table.
The clustering order by clause was supposed to enforce sorting on disk.
CREATE TABLE t1(id int ,
event text,
receivetime timestamp ,
PRIMARY KEY (event, id)
) WITH CLUSTERING ORDER BY (id DESC)
;
The output is unsorted by id:
event | id | receivetime
---------+----+---------------------------------
event1 | 1 | 2021-07-12 08:11:57.702000+0000
event7 | 7 | 2021-05-22 05:30:00.000000+0000
event5 | 5 | 2021-05-25 05:30:00.000000+0000
event9 | 9 | 2021-05-22 05:30:00.000000+0000
event2 | 2 | 2021-05-21 05:30:00.000000+0000
event10 | 10 | 2021-05-23 05:30:00.000000+0000
event4 | 4 | 2021-05-24 05:30:00.000000+0000
event6 | 6 | 2021-05-27 05:30:00.000000+0000
event3 | 3 | 2021-05-22 05:30:00.000000+0000
event8 | 8 | 2021-05-21 05:30:00.000000+0000
How do I overcome this problem?
Thanks
The same question was asked on https://community.datastax.com/questions/11983/ so I'm re-posting my answer here.
The rows within a partition are sorted based on the order of the clustering column, not the partition key.
In your case, the table's primary key is defined as:
PRIMARY KEY (event, id)
This means that each partition key can have one or more rows, with each row identified by the id column. Since there is only one row in each partition, the sorting order is not evident. But if you had multiple rows in each partition, you'd be able to see that they would be sorted. For example:
event | id | receivetime
---------+----+---------------------------------
event1 | 7 | 2021-05-22 05:30:00.000000+0000
event1 | 5 | 2021-05-25 05:30:00.000000+0000
event1 | 1 | 2021-07-12 08:11:57.702000+0000
In the example above, the partition event1 has 3 rows sorted by the ID column in reverse order.
In addition, running unbounded queries (no WHERE clause filter) is an anti-pattern in Cassandra because it requires a full table scan. If you consider a cluster which has 500 nodes, an unbounded query has to request all the partitions (records) from all 500 nodes to return the result. It will not perform well and does not scale. Cheers!
The ordering for a clustering order, is the order within a single partition key value, e.g. all of the rows for event1 would be in order for event1. It is not a global ordering.
From your results we can see you are selecting multiple partitions - which is why you are not seeing an order you expect.

Time varies in postgres server and excel

I am trying a query which groups the data by months.
test_db=# select date_trunc('month', install_ts) AS month, count(id) AS count from api_booking group by month order by month asc;
month | count
------------------------+-------
2016-08-01 00:00:00+00 | 297
2016-09-01 00:00:00+00 | 2409
2016-10-01 00:00:00+00 | 2429
2016-11-01 00:00:00+00 | 3512
(4 rows)
This is the output in my postgres db shell.
How ever, when I try this query in excel, this is the output,
month | count
------------------------+-------
2016-07-31 17:00:00+00 | 297
2016-08-31 17:00:00+00 | 2409
2016-09-30 17:00:00+00 | 2429
2016-10-31 17:00:00+00 | 3512
(4 rows)
The problem is I think excel is understanding date format in some different timezone.
So, How can I tell excel to read it correctly?
OR any solution to this problem?
Try...
select date(date_trunc('month', install_ts)) AS month, count(id) AS count from api_booking
The date() strips out the time from a date with a time.

cassandra composite index and compact storages

I am new in cassandra, have not run it yet, but my business logic requires to create such table.
CREATE TABLE Index(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, keyword, score); )
WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
Is it possible or not? I have only one column(fID) which is not part of my composite index, so i hope I will be able to apply compact_storage setting. Pay attention thet I ordered by third column of my composite index, not second. I need to compact the storage as well, so the keywords will not be repeated for each fID.
A few things initially about your CREATE TABLE statement:
It will error on the semicolon (;) after your PRIMARY KEY definition.
You will need to pick a new name, as Index is a reserved word.
Pay attention thet I ordered by third column of my composite index, not second.
You cannot skip a clustering key when you specify CLUSTERING ORDER.
However, I do see an option here. Depending on your query requirements, you could simply re-order keyword and score in your PRIMARY KEY definition, and then it would work:
CREATE TABLE giveMeABetterName(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, score, keyword)
) WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
That way, you could query by user_id and your rows (keywords?) for that user would be ordered by score:
SELECT * FROM giveMeABetterName WHERE `user_id`=1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4;
If that won't work for your business logic, then you might have to retouch your data model. But it is not possible to skip a clustering key when specifying CLUSTERING ORDER.
Edit
But re-ordering of columns does not work for me. Can I do something like this WITH CLUSTERING ORDER BY (keyword asc, score desc)
Let's look at some options here. I created a table with your original PRIMARY KEY, but with this CLUSTERING ORDER. That will technically work, but look at how it treats my sample data (video game keywords):
aploetz#cqlsh:stackoverflow> SELECT * FROM givemeabettername WHERE user_id=dbeddd12-40c9-4f84-8c41-162dfb93a69f;
user_id | keyword | score | fid
--------------------------------------+------------------+-------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Assassin's creed | 87 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Battlefield 4 | 9 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Uncharted 2 | 91 | 0
(3 rows)
On the other hand, if I alter the PRIMARY KEY to cluster on score first (and adjust CLUSTERING ORDER accordingly), the same query returns this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
Note that you'll want to change the data type of score from TEXT to a numeric (int/bigint) to avoid ASCII-betical sorting, like this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
Something that might help you, is to read through this DataStax doc on Compound Keys and Clustering.

Cassandra Delete by Secondary Index or By Allowing Filtering

I’m trying to delete by a secondary index or column key in a table. I'm not concerned with performance as this will be an unusual query. Not sure if it’s possible? E.g.:
CREATE TABLE user_range (
id int,
name text,
end int,
start int,
PRIMARY KEY (id, name)
)
cqlsh> select * from dat.user_range where id=774516966;
id | name | end | start
-----------+-----------+-----+-------
774516966 | 0 - 499 | 499 | 0
774516966 | 500 - 999 | 999 | 500
I can:
cqlsh> select * from dat.user_range where name='1000 - 1999' allow filtering;
id | name | end | start
-------------+-------------+------+-------
-285617516 | 1000 - 1999 | 1999 | 1000
-175835205 | 1000 - 1999 | 1999 | 1000
-1314399347 | 1000 - 1999 | 1999 | 1000
-1618174196 | 1000 - 1999 | 1999 | 1000
Blah blah…
But I can’t delete:
cqlsh> delete from dat.user_range where name='1000 - 1999' allow filtering;
Bad Request: line 1:52 missing EOF at 'allow'
cqlsh> delete from dat.user_range where name='1000 - 1999';
Bad Request: Missing mandatory PRIMARY KEY part id
Even if I create an index:
cqlsh> create index on dat.user_range (start);
cqlsh> delete from dat.user_range where start=1000;
Bad Request: Non PRIMARY KEY start found in where clause
Is it possible to delete without first knowing the primary key?
No, deleting by using a secondary index is not supported: CASSANDRA-5527
When you have your secondary index you can select all rows from that index. When you have your rows you know the primary key and can then delete the rows.
I came here looking for a solution to delete rows from cassandra column family.
I ended up doing an INSERT and set a TTL (time to live) so that I don't have to worry about deleting it.
Putting it out there, might help someone.

Resources