Cassandra MAX function returning mismatched rows - cassandra

Hi I am trying to get the max coauthor publication from a table in Cassandra, however its returning me mismatched rows when I query
select coauthor_name, MAX(num_of_colab) AS max_2020 from coauthor_by_author where pid = '40/2499' and year=2020;.
It returns:
which is wrong because 9 belongs to another coauthor.
Here is my create statement for the table:
CREATE TABLE IF NOT EXISTS coauthor_by_author (
pid text,
year int,
coauthor_name text,
num_of_colab int,
PRIMARY KEY ((pid), year, coauthor_name, num_of_colab)
) WITH CLUSTERING ORDER BY (year desc);
As proof, here is part of the original table:
As you can see Abdul Hanif Bin Zaini number publication as coauthor should only be 1.

The MAX() function is working as advertised but I think your understanding of how it works is incorrect. Let me illustrate with an example.
Here is the schema for my table of authors:
CREATE TABLE authors_by_coauthor (
author text,
coauthor text,
colabs int,
PRIMARY KEY (author, coauthor)
)
Here is a sample data of authors, their corresponding co-authors, and the number of times they collaborated:
author | coauthor | colabs
---------+-----------+--------
edda | ramakanta | 5
edda | ruzica | 9
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
cassius | ceadda | 14
cassius | flaithri | 13
Anita has three co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'anita';
author | coauthor | colabs
--------+----------+--------
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
And the top number of collaborations for Anita is 12:
SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'anita';
system.max(colabs)
--------------------
12
Similarly, Cassius has two co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'cassius';
author | coauthor | colabs
---------+----------+--------
cassius | ceadda | 14
cassius | flaithri | 13
with 14 as the most collaborations:
cqlsh> > SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'cassius';
system.max(colabs)
--------------------
14
Your question is incomplete since you haven't provided the full sample data but I suspect you're expecting to get the name of the co-author with the most collaborations. This CQL query will NOT return the result you're after:
SELECT coauthor_name, MAX(num_of_colab)
FROM coauthor_by_author
WHERE ...
In SELECT coauthor_name, MAX(num_of_colab), you are incorrectly assuming that the result of MAX(num_of_colab) corresponds to the coauthor_name. Aggregate functions will only ever return ONE row so the result set only ever contains one co-author. The co-author Abdul ... just happens to be the first row in the result so is listed with the MAX() output.
When using aggregate functions, it only makes sense to specify the function in the SELECT statement on its own:
SELECT function(col_name) FROM table WHERE ...
Specifying other columns in the query selectors is meaningless with aggregate functions. Cheers!

Related

How can I filter for a specific date on a CQL timestamp column?

I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?
You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

COGNOS Report: COUNT IF

I am not sure how to go about creating a custom field to count instances given a condition.
I have a field, ID, that exists in two formats:
A#####
B#####
I would like to create two columns (one for A and one for B) and count instances by month. Something like COUNTIF ID STARTS WITH A for the first column resulting in something like below. Right now I can only create a table with the total count.
+-------+------+------+
| Month | ID A | ID B |
+-------+------+------+
| Jan | 100 | 10 |
+-------+------+------+
| Feb | 130 | 13 |
+-------+------+------+
| Mar | 90 | 12 |
+-------+------+------+
Define ID A as...
CASE
WHEN ID LIKE 'A%' THEN 1
ELSE 0
END
...and set the Default aggregation property to Total.
Do the same for ID B.
Apologies if I misunderstood the requirement, but you maybe able to spin the list into crosstab using the section off the toolbar, your measure value would be count(ID).
Try this
Query 1 to count A , filtering by substring(ID,1,1) = 'A'
Query 2 to count B , filtering by substring(ID,1,1) = 'B'
Join Query 1 and Query 2 by Year/Month
List by Month with Count A and Count B

cassandra composite index and compact storages

I am new in cassandra, have not run it yet, but my business logic requires to create such table.
CREATE TABLE Index(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, keyword, score); )
WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
Is it possible or not? I have only one column(fID) which is not part of my composite index, so i hope I will be able to apply compact_storage setting. Pay attention thet I ordered by third column of my composite index, not second. I need to compact the storage as well, so the keywords will not be repeated for each fID.
A few things initially about your CREATE TABLE statement:
It will error on the semicolon (;) after your PRIMARY KEY definition.
You will need to pick a new name, as Index is a reserved word.
Pay attention thet I ordered by third column of my composite index, not second.
You cannot skip a clustering key when you specify CLUSTERING ORDER.
However, I do see an option here. Depending on your query requirements, you could simply re-order keyword and score in your PRIMARY KEY definition, and then it would work:
CREATE TABLE giveMeABetterName(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, score, keyword)
) WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
That way, you could query by user_id and your rows (keywords?) for that user would be ordered by score:
SELECT * FROM giveMeABetterName WHERE `user_id`=1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4;
If that won't work for your business logic, then you might have to retouch your data model. But it is not possible to skip a clustering key when specifying CLUSTERING ORDER.
Edit
But re-ordering of columns does not work for me. Can I do something like this WITH CLUSTERING ORDER BY (keyword asc, score desc)
Let's look at some options here. I created a table with your original PRIMARY KEY, but with this CLUSTERING ORDER. That will technically work, but look at how it treats my sample data (video game keywords):
aploetz#cqlsh:stackoverflow> SELECT * FROM givemeabettername WHERE user_id=dbeddd12-40c9-4f84-8c41-162dfb93a69f;
user_id | keyword | score | fid
--------------------------------------+------------------+-------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Assassin's creed | 87 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Battlefield 4 | 9 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Uncharted 2 | 91 | 0
(3 rows)
On the other hand, if I alter the PRIMARY KEY to cluster on score first (and adjust CLUSTERING ORDER accordingly), the same query returns this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
Note that you'll want to change the data type of score from TEXT to a numeric (int/bigint) to avoid ASCII-betical sorting, like this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
Something that might help you, is to read through this DataStax doc on Compound Keys and Clustering.

Paging Resultsets in Cassandra with compound primary keys - Missing out on rows

So, my original problem was using the token() function to page through a large data set in Cassandra 1.2.9, as explained and answered here: Paging large resultsets in Cassandra with CQL3 with varchar keys
The accepted answer got the select working with tokens and chunk size, but another problem manifested itself.
My table looks like this in cqlsh:
key | column1 | value
---------------+-----------------------+-------
85.166.4.140 | county_finnmark | 4
85.166.4.140 | county_id_20020 | 4
85.166.4.140 | municipality_alta | 2
85.166.4.140 | municipality_id_20441 | 2
93.89.124.241 | county_hedmark | 24
93.89.124.241 | county_id_20005 | 24
The primary key is a composite of key and column1. In CLI, the same data looks like this:
get ip['85.166.4.140'];
=> (counter=county_finnmark, value=4)
=> (counter=county_id_20020, value=4)
=> (counter=municipality_alta, value=2)
=> (counter=municipality_id_20441, value=2)
Returned 4 results.
The problem
When using cql with a limit of i.e. 100, the returned results may stop in the middle of a record, like this:
key | column1 | value
---------------+-----------------------+-------
85.166.4.140 | county_finnmark | 4
85.166.4.140 | county_id_20020 | 4
leaving these to "rows" (columns) out:
85.166.4.140 | municipality_alta | 2
85.166.4.140 | municipality_id_20441 | 2
Now, when I use the token() function for the next page like, these two rows are skipped:
select * from ip where token(key) > token('85.166.4.140') limit 10;
Result:
key | column1 | value
---------------+------------------------+-------
93.89.124.241 | county_hedmark | 24
93.89.124.241 | county_id_20005 | 24
95.169.53.204 | county_id_20006 | 2
95.169.53.204 | county_oppland | 2
So, no trace of the last two results from the previous IP address.
Question
How can I use token() for paging without skipping over cql rows? Something like:
select * from ip where token(key) > token(key:column1) limit 10;
Ok, so I used the info in this post to work out a solution:
http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
(section "CQL3 pagination").
First, I execute this cql:
select * from ip limit 5000;
From the last row in the resultset, I get the key (i.e. '85.166.4.140') and the value from column1 (i.e. 'county_id_20020').
Then I create a prepared statement evaluating to
select * from ip where token(key) = token('85.166.4.140') and column1 > 'county_id_20020' ALLOW FILTERING;
(I'm guessing it would work also without using the token() function, as the check is now for equal:)
select * from ip where key = '85.166.4.140' and column1 > 'county_id_20020' ALLOW FILTERING;
The resultset now contains the remaining X rows (columns) for this IP. The method then returns all the rows, and the next call to the method includes the last used key ('85.166.4.140'). With this key, I can execute the following select:
select * from ip where token(key) > token('85.166.4.140') limit 5000;
which gives me the next 5000 rows from (and including) the first IP after '85.166.4.140'.
Now, no columns are lost in the paging.
UPDATE
Cassandra 2.0 introduced automatic paging, handled by the client.
More info here: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
(note that setFetchSize is optional and not necessary for paging to work)

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources