How can I filter for a specific date on a CQL timestamp column? - cassandra

I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?

You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

Related

Cassandra MAX function returning mismatched rows

Hi I am trying to get the max coauthor publication from a table in Cassandra, however its returning me mismatched rows when I query
select coauthor_name, MAX(num_of_colab) AS max_2020 from coauthor_by_author where pid = '40/2499' and year=2020;.
It returns:
which is wrong because 9 belongs to another coauthor.
Here is my create statement for the table:
CREATE TABLE IF NOT EXISTS coauthor_by_author (
pid text,
year int,
coauthor_name text,
num_of_colab int,
PRIMARY KEY ((pid), year, coauthor_name, num_of_colab)
) WITH CLUSTERING ORDER BY (year desc);
As proof, here is part of the original table:
As you can see Abdul Hanif Bin Zaini number publication as coauthor should only be 1.
The MAX() function is working as advertised but I think your understanding of how it works is incorrect. Let me illustrate with an example.
Here is the schema for my table of authors:
CREATE TABLE authors_by_coauthor (
author text,
coauthor text,
colabs int,
PRIMARY KEY (author, coauthor)
)
Here is a sample data of authors, their corresponding co-authors, and the number of times they collaborated:
author | coauthor | colabs
---------+-----------+--------
edda | ramakanta | 5
edda | ruzica | 9
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
cassius | ceadda | 14
cassius | flaithri | 13
Anita has three co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'anita';
author | coauthor | colabs
--------+----------+--------
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
And the top number of collaborations for Anita is 12:
SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'anita';
system.max(colabs)
--------------------
12
Similarly, Cassius has two co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'cassius';
author | coauthor | colabs
---------+----------+--------
cassius | ceadda | 14
cassius | flaithri | 13
with 14 as the most collaborations:
cqlsh> > SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'cassius';
system.max(colabs)
--------------------
14
Your question is incomplete since you haven't provided the full sample data but I suspect you're expecting to get the name of the co-author with the most collaborations. This CQL query will NOT return the result you're after:
SELECT coauthor_name, MAX(num_of_colab)
FROM coauthor_by_author
WHERE ...
In SELECT coauthor_name, MAX(num_of_colab), you are incorrectly assuming that the result of MAX(num_of_colab) corresponds to the coauthor_name. Aggregate functions will only ever return ONE row so the result set only ever contains one co-author. The co-author Abdul ... just happens to be the first row in the result so is listed with the MAX() output.
When using aggregate functions, it only makes sense to specify the function in the SELECT statement on its own:
SELECT function(col_name) FROM table WHERE ...
Specifying other columns in the query selectors is meaningless with aggregate functions. Cheers!

Cassandra is unexpectedly prepending zeros in timestamp millisecond

My code is reading data from Kafka and writing it to Cassandra using Spark. But in some cases it is appending the zero in front of millisecond.
For Example-
Kafka Data: 2022-10-11T08:46:12.220Z
Cassandra Data: 2022-10-11 14:16:12.022000+0000
Another Example Where we are expecting 2022-07-31 23:28:46.960000+0000 but in Cassandra it is present as 2022-07-31 23:28:46.096000+0000
How a zero is getting prepended in millisecond and how resolve it? It is only happening in some cases, most of the timestamp are coming correctly.
Note- The difference in hour and minute is due to change in timezone.
I suspect you're looking at different records because the timestamps you posted are not the same as what you think they should be.
I happen to have a table that contains timestamps and I inserted the same timestamps you posted above and I can confirm that Cassandra is not adding the leading zeros.
Here is my table schema:
CREATE TABLE community.tstamp_table (
id int,
tstamp timestamp,
name text,
PRIMARY KEY (id, tstamp)
)
And here are the table's contents with the timestamps you posted:
id | tstamp | name
----+---------------------------------+-------
1 | 2022-10-11 14:16:12.022000+0000 | alice
1 | 2022-10-11 14:16:12.220000+0000 | alice
2 | 2022-07-31 23:28:46.096000+0000 | bob
2 | 2022-07-31 23:28:46.960000+0000 | bob
The CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT). Knowing this, we can display the timestamp as an integer value using some native CQL functions:
system.blobasbigint(system.timestampasblob(tstamp)) | tstamp
-----------------------------------------------------+---------------------------------
1665497772022 | 2022-10-11 14:16:12.022000+0000
1665497772220 | 2022-10-11 14:16:12.220000+0000
1659310126096 | 2022-07-31 23:28:46.096000+0000
1659310126960 | 2022-07-31 23:28:46.960000+0000
You should be able to see from the sample data above that the encoded value of the timestamps are correct. For example, the first row is encoded correctly with 022 ms vs 220 ms and the third row is encoded as 096 ms vs 960 ms. Cheers!

Cassandra: cannot restrict 2 columns using clustering key(version 2.1.9)

I have a schema pretty similar to this:-
create table x(id int, start_date timestamp, end_date timestamp,
primary key((id), start_date, end_date))
with clustering order by (start_date desc, end_date desc);
Now I am stuck with a problem where I have to query between start date and end date. something like this : -
select count(*) from x where id=2 and start_date > 'date' and end_date < 'date' ;
But it gives me an error similar to the following: -
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column "end_date"
cannot be restricted (preceding column "start_date" is restricted
by a non-EQ relation)"
I am new to cassandra, any and all suggestions are welcomed even if it requires us to do a schema change. :)
You don't say which version of Cassandra you are running, but in 2.2 and later you can do multi-column slice restrictions on clustering columns. This can get close to what you want. The syntax in CQL is a little ugly, but basically you have to specify the starting range with all the clustering columns specified, like a compound key. It's important to think about the rows being sorted first by the first column, then within that sorted by the second column.
So assume we have this data:
SELECT * from x;
id | start_date | end_date
----+--------------------------+--------------------------
2 | 2015-09-01 09:16:47+0000 | 2015-11-01 09:16:47+0000
2 | 2015-08-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
2 | 2015-07-01 09:16:47+0000 | 2015-09-01 09:16:47+0000
2 | 2015-06-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
Now let's select based on both dates:
SELECT * from x where id=2
and (start_date,end_date) >= ('2015-07-01 09:16:47+0000','2015-07-01 09:16:47+0000')
and (start_date,end_date) <= ('2015-09-01 09:16:47+0000','2015-09-01 09:16:47+0000');
id | start_date | end_date
----+--------------------------+--------------------------
2 | 2015-08-01 09:16:47+0000 | 2015-10-01 09:16:47+0000
2 | 2015-07-01 09:16:47+0000 | 2015-09-01 09:16:47+0000
Now you'll notice that one of those end dates appears to be later than our restriction, but it isn't. Since things are sorted by start_date first, you'll get all the end dates with a matching start_date since they are in the range of the compound range. To get rid of rows like that you'll probably need to do a little filtering on the client side.
See more information here, under "Multi-column slice restrictions".

cassandra composite index and compact storages

I am new in cassandra, have not run it yet, but my business logic requires to create such table.
CREATE TABLE Index(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, keyword, score); )
WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
Is it possible or not? I have only one column(fID) which is not part of my composite index, so i hope I will be able to apply compact_storage setting. Pay attention thet I ordered by third column of my composite index, not second. I need to compact the storage as well, so the keywords will not be repeated for each fID.
A few things initially about your CREATE TABLE statement:
It will error on the semicolon (;) after your PRIMARY KEY definition.
You will need to pick a new name, as Index is a reserved word.
Pay attention thet I ordered by third column of my composite index, not second.
You cannot skip a clustering key when you specify CLUSTERING ORDER.
However, I do see an option here. Depending on your query requirements, you could simply re-order keyword and score in your PRIMARY KEY definition, and then it would work:
CREATE TABLE giveMeABetterName(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, score, keyword)
) WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
That way, you could query by user_id and your rows (keywords?) for that user would be ordered by score:
SELECT * FROM giveMeABetterName WHERE `user_id`=1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4;
If that won't work for your business logic, then you might have to retouch your data model. But it is not possible to skip a clustering key when specifying CLUSTERING ORDER.
Edit
But re-ordering of columns does not work for me. Can I do something like this WITH CLUSTERING ORDER BY (keyword asc, score desc)
Let's look at some options here. I created a table with your original PRIMARY KEY, but with this CLUSTERING ORDER. That will technically work, but look at how it treats my sample data (video game keywords):
aploetz#cqlsh:stackoverflow> SELECT * FROM givemeabettername WHERE user_id=dbeddd12-40c9-4f84-8c41-162dfb93a69f;
user_id | keyword | score | fid
--------------------------------------+------------------+-------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Assassin's creed | 87 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Battlefield 4 | 9 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Uncharted 2 | 91 | 0
(3 rows)
On the other hand, if I alter the PRIMARY KEY to cluster on score first (and adjust CLUSTERING ORDER accordingly), the same query returns this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
Note that you'll want to change the data type of score from TEXT to a numeric (int/bigint) to avoid ASCII-betical sorting, like this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
Something that might help you, is to read through this DataStax doc on Compound Keys and Clustering.

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources