How to count the duplicate records in a table in cassandra - cassandra

Let say my data is like below:
Acct_id | amount
--------|-------
10001 |6.00
20000 |5.00
32356 |1.00
10001 |2.00
45000 |1.50
45000 |10.00
My expected result should be like this:
acct_id| count
-------|-----
10001 | 2
45000 | 2
How do i get it in cassandra?

How do i get it in cassandra?
If you're using Cassandra 2.2.x or 3.x you can create an user defined aggregate
CREATE FUNCTION counByAccId(state map<int, int>, acctid int)
RETURNS NULL ON NULL INPUT
RETURNS map<int, int>
LANGUAGE java
AS '
if(state.containsKey(acctid)) {
Integer currentCount = (Integer)state.get(acctid);
state.put(acctid, currentCount + 1);
} else {
state.put(acctid, 1);
}
return state;
';
CREATE AGGREGATE groupByAcctIdAndCount(int)
SFUNC counByAccId
STYPE map<int, int>
INITCOND {};
SELECT groupByAcctIdAndCount(acct_id) FROM myTable WHERE partition_key = xxx;
Example data set:
select * from agg;
partition_key | acct_id | val
---------------+---------+-----
5 | 45000 | 1.5
1 | 10001 | 6
2 | 20000 | 5
4 | 10001 | 2
6 | 45000 | 10
3 | 32356 | 1
select groupByAcctIdAndCount(acctid) FROM agg;
music.groupbyacctidandcount(acct_id)
------------------------------------------
{10001: 2, 20000: 1, 32356: 1, 45000: 2}
WARNING: be sure to read my blog about UDA and the implication in term of performance when scanning a full table: http://www.doanduyhai.com/blog/?p=2015

Related

How can I filter for a specific date on a CQL timestamp column?

I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?
You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
πŸ‘‰ Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. πŸ™ Thanks!

Cassandra MAX function returning mismatched rows

Hi I am trying to get the max coauthor publication from a table in Cassandra, however its returning me mismatched rows when I query
select coauthor_name, MAX(num_of_colab) AS max_2020 from coauthor_by_author where pid = '40/2499' and year=2020;.
It returns:
which is wrong because 9 belongs to another coauthor.
Here is my create statement for the table:
CREATE TABLE IF NOT EXISTS coauthor_by_author (
pid text,
year int,
coauthor_name text,
num_of_colab int,
PRIMARY KEY ((pid), year, coauthor_name, num_of_colab)
) WITH CLUSTERING ORDER BY (year desc);
As proof, here is part of the original table:
As you can see Abdul Hanif Bin Zaini number publication as coauthor should only be 1.
The MAX() function is working as advertised but I think your understanding of how it works is incorrect. Let me illustrate with an example.
Here is the schema for my table of authors:
CREATE TABLE authors_by_coauthor (
author text,
coauthor text,
colabs int,
PRIMARY KEY (author, coauthor)
)
Here is a sample data of authors, their corresponding co-authors, and the number of times they collaborated:
author | coauthor | colabs
---------+-----------+--------
edda | ramakanta | 5
edda | ruzica | 9
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
cassius | ceadda | 14
cassius | flaithri | 13
Anita has three co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'anita';
author | coauthor | colabs
--------+----------+--------
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
And the top number of collaborations for Anita is 12:
SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'anita';
system.max(colabs)
--------------------
12
Similarly, Cassius has two co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'cassius';
author | coauthor | colabs
---------+----------+--------
cassius | ceadda | 14
cassius | flaithri | 13
with 14 as the most collaborations:
cqlsh> > SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'cassius';
system.max(colabs)
--------------------
14
Your question is incomplete since you haven't provided the full sample data but I suspect you're expecting to get the name of the co-author with the most collaborations. This CQL query will NOT return the result you're after:
SELECT coauthor_name, MAX(num_of_colab)
FROM coauthor_by_author
WHERE ...
In SELECT coauthor_name, MAX(num_of_colab), you are incorrectly assuming that the result of MAX(num_of_colab) corresponds to the coauthor_name. Aggregate functions will only ever return ONE row so the result set only ever contains one co-author. The co-author Abdul ... just happens to be the first row in the result so is listed with the MAX() output.
When using aggregate functions, it only makes sense to specify the function in the SELECT statement on its own:
SELECT function(col_name) FROM table WHERE ...
Specifying other columns in the query selectors is meaningless with aggregate functions. Cheers!

Cassandra - How to get the Day of week from the timestamp column?

I have a timestamp column in a Cassandra table, how do i get the day of week from the timestamp column using cql?
There isn't support out of the box for this but
If using the CQL is a must you can have a look at the User Defined Functions:
http://cassandra.apache.org/doc/latest/cql/functions.html
http://www.datastax.com/dev/blog/user-defined-functions-in-cassandra-3-0
http://docs.datastax.com/en//cql/latest/cql/cql_using/useCreateUDF.html
Then you could use something as simple as:
How to determine day of week by passing specific date?
or even something like
Aggregation with Group By date in Spark SQL
And then you have a UDF that gives you day of the week when you are working with a dates.
Maybe this answer will be helpful for some one still looking for an answer in 2022.
You can create an user defined function:
CREATE OR REPLACE FUNCTION DOW(
input_date_string varchar,
date_pattern varchar
)
CALLED ON NULL INPUT
RETURNS int
LANGUAGE java AS
'
int ret = -1;
try {
ret = java.time.LocalDate.parse(input_date_string, java.time.format.DateTimeFormatter.ofPattern(date_pattern))
.getDayOfWeek()
.getValue();
} catch (java.lang.Exception ex) {
// error, do nothing here and -1 will be returned
}
return ret;
';
Test
cqlsh:store> create table testdate(key int PRIMARY KEY , date_string varchar );
... insert some date_strings ...
INSERT INTO testdate (key , date_string ) VALUES ( 9, '2022-11-22');
...
cqlsh:store> select date_string, dow(date_string, 'yyyy-MM-dd') from testdate;
date_string | store.dow(date_string, 'yyyy-MM-dd')
-------------+--------------------------------------
50/11/2022 | -1
2022-11-23 | 3
19/11/2024 | -1
2022-11-21 | 1
19/11/2023 | -1
19/11/20249 | -1
2022-11-20 | 7
50/aa/2022 | -1
2022-11-22 | 2
19/11/2024 | -1
Similar function with timestamp argument
CREATE OR REPLACE FUNCTION DOW_TS(
input_date_time timestamp,
zone_id varchar
)
CALLED ON NULL INPUT
RETURNS int
LANGUAGE java AS
'
int ret = -1;
try {
ret = input_date_time.toInstant().atZone(java.time.ZoneId.of(zone_id)).toOffsetDateTime()
.getDayOfWeek()
.getValue();
} catch (java.lang.Exception ex) {
// error, do nothing here and -1 will be returned
}
return ret;
';
Test
cqlsh:store> select id, dt, dow_ts(dt, 'UTC'), dow_ts(dt,'WHAT') from testdt;
id | dt | store.dow_ts(dt, 'UTC') | store.dow_ts(dt, 'WHAT')
----+---------------------------------+-------------------------+--------------------------
1 | 2022-11-19 14:30:47.420000+0000 | 6 | -1
Above functions had been tested with below cassandra's setup:
INFO [main] 2022-11-19 12:25:47,004 CassandraDaemon.java:632 - JVM vendor/version: OpenJDK 64-Bit Server VM/11.0.17
INFO [main] 2022-11-19 12:25:50,737 StorageService.java:736 - Cassandra version: 4.0.7
INFO [main] 2022-11-19 12:25:50,738 StorageService.java:737 - CQL version: 3.4.5
References:
https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCreateUDF.html
https://cassandra.apache.org/_/quickstart.html
Hint: you should ensure to set "enable_user_defined_functions: true" in /etc/cassandra/cassandra.yaml.
With docker option above (https://cassandra.apache.org/_/quickstart.html), you do a quick hack as below
$ docker run --rm -d --name cassandra --hostname cassandra --network cassandra cassandra
$ docker cp cassandra:/etc/cassandra/cassandra.yaml .
Use your favorite editor to change "enable_user_defined_functions: false" to "enable_user_defined_functions: true" in "$(pwd)"/cassandra.yaml
$ docker run --rm -d --name cassandra --hostname cassandra --network cassandra --mount type=bind,source="$(pwd)"/cassandra.yaml,target=/etc/cassandra/cassandra.yaml cassandra
If you have very old cassandra version, which does not support java8 then maybe below altenative would work (see https://en.wikipedia.org/wiki/Determination_of_the_day_of_the_week)
CREATE OR REPLACE FUNCTION DOW_Tomohiko_Sakamoto(
input_date_time timestamp
)
CALLED ON NULL INPUT
RETURNS int
LANGUAGE java AS
'
int y = input_date_time.getYear() + 1900;
int m = input_date_time.getMonth() + 1;
int d = input_date_time.getDate();
int t[] = {0, 3, 2, 5, 0, 3, 5, 1, 4, 6, 2, 4};
if (m < 3) {
y -= 1;
}
int ret = (y + y / 4 - y / 100 + y / 400 + t[m - 1] + d) % 7;
if (ret == 0) {
ret = 7;
}
return ret;
';
TEST
cqlsh:store> insert into data(id, dt ) VALUES (2, '2022-11-19 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (3, '2022-11-21 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (4, '2022-11-23 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (5, '2022-11-24 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (7, '2022-11-25 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (8, '2022-11-26 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (9, '2022-11-27 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (10, '2022-11-28 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (11, '2020-02-29 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (12, '2020-02-30 00:00:00+0000');
cqlsh:store> insert into data(id, dt ) VALUES (13, '2020-02-31 00:00:00+0000');
cqlsh:store> select id, dt, dow_ts(dt,'UTC'), DOW_Tomohiko_Sakamoto(dt) from data;
id | dt | store.dow_ts(dt, 'UTC') | store.dow_tomohiko_sakamoto(dt)
----+---------------------------------+-------------------------+---------------------------------
5 | 2022-11-24 00:00:00.000000+0000 | 4 | 4
10 | 2022-11-28 00:00:00.000000+0000 | 1 | 1
13 | 2020-02-29 00:00:00.000000+0000 | 6 | 6
11 | 2020-02-29 00:00:00.000000+0000 | 6 | 6
1 | 2022-11-20 17:43:28.568000+0000 | 7 | 7
8 | 2022-11-26 00:00:00.000000+0000 | 6 | 6
2 | 2022-11-19 00:00:00.000000+0000 | 6 | 6
4 | 2022-11-23 00:00:00.000000+0000 | 3 | 3
7 | 2022-11-25 00:00:00.000000+0000 | 5 | 5
9 | 2022-11-27 00:00:00.000000+0000 | 7 | 7
12 | 2020-02-29 00:00:00.000000+0000 | 6 | 6
3 | 2022-11-21 00:00:00.000000+0000 | 1 | 1

Cassandra how to add values in a single row on every hit

In this table application will feed us with the below data and it will be incremental as and when we will receive updates on the status . So initially table will look like the below as shown:-
+---------------+---------------+---------------+---------------+
| ID | Total count | Failed count | Success count |
+---------------+---------------+---------------+---------------+
| 1 | 30 | 10 | 20 |
+---------------+---------------+---------------+---------------+
Now let’s assume total 30 messages are pushed now out of which 10 Failed and 20 Success as shown above.Now again application is run and values changed . Now total 20 new records came in out of which all are success. This should be updated in the same row .
+---------------+---------------+---------------+---------------+
| ID | Total count | Failed count | Success count |
+---------------+---------------+---------------+---------------+
| 1 | 50 | 10 | 40 |
+---------------+---------------+---------------+---------------+
Is it feasible in Cassandra DB using Counter data type?
Of course you can use counter tables in your case.
Let's assume table structure like :
CREATE KEYSPACE Test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
CREATE TABLE data (
id int,
data string,
PRIMARY KEY (id)
);
CREATE TABLE counters (
id int,
total_count counter,
failed_count counter,
success_coutn counter,
PRIMARY KEY (id)
);
You can increment counters by running queries like :
UPDATE counters
SET total_count = total_count + 1,
success_count = success_count + 1
WHERE id= 1;
Hope this can help you.

Spark: How to join RDDs by time range

I have a delicate Spark problem, where i just can't wrap my head around.
We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contains Historic data. Both have an id on which they can be matched/joined. But the problem is the two tables have an N:N relation ship. Actions contains multiple rows with the same id and so does Historic. Here are some example date from both tables.
Actions time is actually a timestamp
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic set_at is actually a timestamp
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
How can we join these two tables in a way, that we get a result like this
1 | 100 # 500 - 400 for Actions#1 with time 12:05 because Historic was in that time at 400
1 | 50 # 500 - 450 for Actions#2 with time 12:30 because H. was in that time at 450
2 | 50 # 125 - 75 for Actions#3 with time 12:30 because H. was in that time at 75
I can't come up with a good solution that feels right, without making a lot of iterations over huge datasets. I always have to think about making a range from the Historic set and then somehow check if the Actions fits in the range e.g (11:00 - 12:15) to make the calculation. But that seems to pretty slow to me. Is there any more efficient way to do that? Seems to me, that this kind of problem could be popular, but i couldn't find any hints on this yet. How would you solve this problem in spark?
My current attempts so far ( in half way done code )
case class Historic(id: String, set_at: Long, valueY: Int)
val historicRDD = sc.cassandraTable[Historic](...)
historicRDD
.map( row => ( row.id, row ) )
.reduceByKey(...)
// transforming to another case which results in something like this; code not finished yet
// (List((Range(0, 12:25), 400), (Range(12:25, NOW), 450)))
// From here we could join with Actions
// And then some .filter maybe to select the right Lists tuple
It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:
Given case classes for Action(id, time, x) and Historic(id, time, y)
Join the actions with the history (this might be heavy)
filter all historic data not relevant for a given action
key the results by (id,time) - differentiate same key at different times
reduce the history by action to the max value, leaving us with relevant historical record for the given action
In Spark:
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) => (a1, if (h1.t>h2.t) h1 else h2)}
// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
Using the data provided above, the report looks like:
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
(I transformed the time to seconds to have a simplistic timestamp)
After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.
First we expand our case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
The case class is ready and now we bring it into action
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
I am totally new to Scala, so please let me know if we could improve this code on some place.
I know that this question has been answered but I want to add another solution that worked for me -
your data -
Actions
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
Union Actions and Historic
Combined
id | time | valueX | record-type
1 | 12:05 | 500 | Action
1 | 12:30 | 500 | Action
2 | 12:30 | 125 | Action
1 | 11:00 | 400 | Historic
1 | 12:15 | 450 | Historic
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.
Partition-1
1 | 11:00 | 400 | Historic
1 | 12:05 | 500 | Action
1 | 12:15 | 450 | Historic
1 | 12:30 | 500 | Action
Partition-2
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
2 | 12:30 | 125 | Action
Traverse through the records per partition.
If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.
If it is a Action record, get the valueY value from the map and subtract it from valueX
A map M
Partition-1 traversal in order
M={ 1 -> 400} // A new entry in map M
1 | 100 // M(1) = 400; 500-400
M={1 -> 450} // update M, because key already exists
1 | 50 // M(1)
Partition-2 traversal in order
M={ 2 -> 50} // A new entry in M
M={ 2 -> 75} // update M, because key already exists
2 | 50 // M(2) = 75; 125-75
You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.
This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.

Resources