Last record each group in cassandra - cassandra

I has a table with schema:
create table last_message_by_group
(
date date,
created_at timestamp,
message text,
group_id bigint,
primary key (date, created_at, message_id)
)
with clustering order by (created_at desc)
and data should be:
| date | created_at | message | group_id |
| 2021-05-11 | 7:23:54 | ddd | 1 |
| 2021-05-11 | 6:21:43 | ccc | 1 |
| 2021-05-11 | 5:35:16 | bbb | 2 |
| 2021-05-11 | 4:38:23 | aaa | 2 |
It will show messages order by created_at desc partition by date.
But the problem is it can not get last message each group likes:
| date | created_at | message | group_id |
| 2021-05-11 | 7:23:54 | ddd | 1 |
| 2021-05-11 | 5:35:16 | bbb | 2 |
created_at is cluster key, so it cant be updated, so I delete and insert new row every new message by group_id, this way make low performance
Is there any way to do that?

I was able to get this to work by making one change to your primary key definition. I added group_id as the first clustering key:
PRIMARY KEY (date, group_id, created_at, message_id)
After inserting the same data, this works:
> SELECT date, group_id, max(created_at), message
FROM last_message_by_group
WHERE date='2021-05-11'
GROUP BY date,group_id;
date | group_id | system.max(created_at) | message
------------+----------+---------------------------------+---------
2021-05-11 | 1 | 2021-05-11 12:23:54.000000+0000 | ddd
2021-05-11 | 2 | 2021-05-11 10:35:16.000000+0000 | bbb
(2 rows)
There's more detail on using CQL's GROUP BY clause in the official docs.
there is one problem, because you changed clustering key, so message will be ordered by group_id first. Any idea for still order by created_at and 1 message each group?
From the document linked above:
the GROUP BY option only accept as arguments primary key column names in the primary key order.
Unfortunately, if we were to adjust the primary key definition to put created_at before group_id, we would also have to group by created_at. That would create a "group" for each unique created_at, which negates the idea behind group_id.
In this case, you may have to decide between having the grouped results in a particular order vs. having them grouped at all. It might also be possible to group the results, but then re-order them appropriately on the application side.

Related

Cassnadra - Update/Delete based on timestamp datatype

I have below table structure which houses failed records.
CREATE TABLE if not exists dummy_plan (
id uuid,
payload varchar,
status varchar,
bucket text,
create_date timestamp,
modified_date timestamp,
primary key ((bucket), create_date, id))
WITH CLUSTERING ORDER BY (create_date ASC)
AND COMPACTION = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1};
My table looks like below
| id | payload | status | bucket | create_date | modified_date |
| abc| text1 | Start | 2021-02-15 | 2021-02-15 08:07:50+0000 | |
Table and records are created and inserted successfully. However after processing, we want to update (if failed) and delete (if successful) record based on Id.
But am facing problem with timestamp where I tried giving same value but it still doesn't deletes/updates.
Seems Cassandra doesn't works with EQ with timestamp.
Please guide.
Thank you in advance.
Cassandra works just fine with the timestamp columns - you can use equality operation on that. But you need to make sure that you include milliseconds into the value, otherwise it won't match:
cqlsh> insert into test.dummy_service_plan_contract (id, create_date, bucket)
values (1, '2021-02-15T11:00:00.123Z', '123');
cqlsh> select * from test.dummy_service_plan_contract;
bucket | create_date | id | modified_date | payload | status
--------+---------------------------------+----+---------------+---------+--------
123 | 2021-02-15 11:00:00.123000+0000 | 1 | null | null | null
(1 rows)
cqlsh> delete from test.dummy_service_plan_contract where bucket = '123' and
id = 1 and create_date = '2021-02-15T11:00:00Z';
cqlsh> select * from test.dummy_service_plan_contract;
bucket | create_date | id | modified_date | payload | status
--------+---------------------------------+----+---------------+---------+--------
123 | 2021-02-15 11:00:00.123000+0000 | 1 | null | null | null
(1 rows)
cqlsh> delete from test.dummy_service_plan_contract where bucket = '123' and
id = 1 and create_date = '2021-02-15T11:00:00.123Z';
cqlsh> select * from test.dummy_service_plan_contract;
bucket | create_date | id | modified_date | payload | status
--------+-------------+----+---------------+---------+--------
(0 rows)
If you don't see the milliseconds in your output in the cqlsh, then you need to configure datetimeformat setting in the .cqlshrc

How to retrieve data from cassandra using IN Query with given order?

I'm selecting data from a Cassandra database using a query. It is working fine but how to get the data in same order as I have given IN query?
I have created table like this:
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
I am trying to select data using
SELECT *
FROM malleshdmy
WHERE id IN ( 11,10,5)
But, It producing same data as like stored.
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
Please help me in this issue.
I want data as 11,10 and 5
If the id is partition key, then it's impossible - data are sorted only inside the clustering columns, and data for different partition keys could be returned in arbitrary order (but sorted inside that partition).
You need to sort data yourself.
Since id is your partition key, your data is actually being sorted by the token of id, not the values themselves:
cqlsh:testid> SELECT id,n,p,q,token(id) FROM table;
id | n | p | q | system.token(id)
----+---+---+------+----------------------
5 | 1 | 2 | 4 | -7509452495886106294
10 | 2 | 4 | 3 | -6715243485458697746
11 | 1 | 2 | null | -4156302194539278891
Because of this, you don't have any control over how the partition key is sorted.
In order to sort your data by id, you need to make id a clustering column rather than a partition key. Your data will still need a partition key, however, and this will always be sorted by token.
If you decide to make id a clustering column, you will need to specify that you want a descending order in your order by statement
CREATE TABLE clusterTable (
... partition type, //partition key with a type to be specified
... id INT,
... n INT,
... p INT,
... q INT,
... PRIMARY KEY((partition),id))
... WITH CLUSTERING ORDER BY (id DESC);
This link is very helpful in discussing how ordering works in Cassandra: https://www.datastax.com/dev/blog/we-shall-have-order

Cassandra storage internal

I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
category text,
subcategory text,
itemid text,
count int,
price int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, category, subcategory, itemid, count, price) - #2
);
Suppose that I have a table like above.
In case of #1, a CQL row will generate 6(or 5?) columns in storage.
In case of #2, a CQL row will generate a very composite column in storage.
I'm wondering what's more effective way for storing logs into Cassandra.
Please focus on those given two situations.
I don't need any real-time reads. Just writings.
If you want to suggest other options please refer to the following.
The reasons I chose Cassandra for storing logs are
Linear scalability and good for heavy writing.
It has schema in CQL. I really prefer having a schema.
Seems to support Spark well enough. Datastax's cassandra-spark connector seems to have data locality awareness.
I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
Let's say that I build tables with both of your PRIMARY KEYs, and INSERT some data:
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
Looks pretty much the same via cqlsh. So let's have a look from the cassandra-cli, and query all rows foor userid 1002:
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)
Simple enough, right? We see userid 1002 as the RowKey, and our clustering column of time as a column key. Following that, are all of our columns for each column key (time). And I believe your first instance generates 6 columns, as I'm pretty sure that includes the placeholder for the column key, because your PRIMARY KEY could point to an empty value (as your 2nd example key does).
But what about the 2nd version for userid 1002?
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)
Two columns are returned for RowKey 1002, one for each unique combination of our column (clustering) keys, with an empty value (as mentioned above).
So what does this all mean for you? Well, a few things:
This should tell you that PRIMARY KEYs in Cassandra ensure uniqueness. So if you decide that you need to update key values like category or subcategory (2nd example) that you really can't unless you DELETE and recreate the row. Although from a logging perspective, that's probably ok.
Cassandra stores all data for a particular partition/row key (userid) together, sorted by the column (clustering) keys. If you were concerned about querying and sorting your data, it would be important to understand that you would have to query for each specific userid for sort order to make any difference.
The biggest issue I see, is that right now you are setting yourself up for unbounded column growth. Partition/row keys can support a maximum of 2 billion columns, so your 2nd example will help you out the most there. If you think some of your userids might exceed that, you could implement a "date bucket" as an additional partition key (say, if you knew that a userid would never exceed more than 2 billion in a year, or whatever).
It looks to me like your 2nd option might be the better choice. But honestly for what you're doing, either of them will probably work ok.

Cassandra: low cardinality partition

Let's say I have a table, something like this:
CREATE TABLE Users (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((user), seq)
);
This follows the desired pattern of Cassandra, with good distribution across partitions (assuming the default Murmur3 hash partitioner).
However, I also need to (rarely) perform range queries on and in time order. This doesn't seem possible in Cassandra. In reality I do need to access the data by group, so (group, time) is acceptable. Since there doesn't seem a way to have secondary index have multiple columns, I guess the right thing is to denormalize, into something like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((group), time)
) WITH CLUSTERING ORDER BY (time ASC);
This works entirely as it should, except that group is really low cardinality, let's say ('A','B','C'), and uneven distribution across users. Since queries on that table is rare, I'm not worried about hot nodes, but I am worried about uneven distribution, perhaps even a single node getting all.
Is this a common scenario and is there any way to mitigate this or are there alternative solutions?
One technique to help avoid hot-spots in Cassandra time series models, is in making use of a "time bucket." Essentially what you would do is determine the "happy medium" level of time precision that provides adequate data distribution, while also being known and semi-convenient to query by.
For the purposes of this example, I'll choose year and month ("yyyyMM"). Note: I have no idea if year and month will work for you...it's just an example. Once you determine your time bucket, you would add it as an additional partition key, like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time TIMEUUID,
yearmonth BIGINT,
PRIMARY KEY ((group, yearmonth), time)
) WITH CLUSTERING ORDER BY (time DESC);
After inserting some rows, queries like this will work:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, dateof(time), time, seq, user
FROM usersbygrouptime WHERE group='B' AND yearmonth=201505;
group | yearmonth | dateof(time) | time | seq | user
-------+-----------+--------------------------+--------------------------------------+-----+--------------------------------------
B | 201505 | 2015-05-16 10:04:10-0500 | ceda56f0-fbdc-11e4-bd43-21b264d4c94d | 1 | d57ba8a4-db24-440c-a983-b1dd6b0d2e27
B | 201505 | 2015-05-16 10:04:09-0500 | ce1cac40-fbdc-11e4-bd43-21b264d4c94d | 1 | 66d07cbb-a2ff-4d56-8fa1-14dfaf684474
B | 201505 | 2015-05-16 10:04:08-0500 | cd525760-fbdc-11e4-bd43-21b264d4c94d | 1 | 07b589ac-4d5f-401e-a34f-e3479e269e01
B | 201505 | 2015-05-16 10:04:06-0500 | cc76c470-fbdc-11e4-bd43-21b264d4c94d | 1 | 984f85b5-ea58-4cf8-b512-43abacb227c9
(4 rows)
Now that may or may not help you query-wise, so you will need to spend some time ensuring that you pick an appropriate time bucket. But, this does help in terms of data distribution in the ring, which you can see with the token function:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, token(group,yearmonth)
FROM usersbygrouptime ;
group | yearmonth | token(group, yearmonth)
-------+-----------+-------------------------
A | 201503 | -3784784210711042553
A | 201504 | -610775546464185720
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
(12 rows)
Notice how different tokens are generated for each group/yearmonth pair, even though some of them have the same group ("A").

Range query - Data modeling for time series in CQL Cassandra

I have a table like this:
CREATE TABLE test ( partitionkey text, rowkey text, date
timestamp, policyid text, policyname text, primary key
(partitionkey, rowkey));
with some data:
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
p2 | r3 | pl3 | plicy3 | 2008-01-03 00:00:00+0000
I want to be able to find:
1/ data from a particular partition key
2/ data from a particular partition key & rowkey
3/ Range query on date given a partitionkey
1/ and 2/ are trivial:
select * from test where partitionkey='p1';
partitionkey | rowkey | policyid | policyname | range
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
but what about 3/?
Even with an index it doesnt work:
create index i1 on test (date);
select * from test where partitionkey='p1' and date =
'2007-01-02';
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 plicy1 | 2007-01-02 00:00:00+0000
but
select * from test where partitionkey='p1' and
date > '2007-01-02';
Bad Request: No indexed columns present in
by-columns clause with Equal operator
Any idea?
thanks,
Matt
CREATE TABLE test ( partitionkey text, rowkey text, date timestamp,
policyid text, policyname text, primary key (partitionkey, rowkey));
First of all, you really should use more descriptive column names instead of partitionkey and rowkey (and even date, for that matter). By looking at those column names, I really can't tell what kind of data this table is supposed to be indexed by.
select * from test where partitionkey='p1' and date > '2007-01-02';
Bad Request: No indexed columns present in by-columns clause with Equal operator
As for this issue, try making your "date" column a part of your primary key.
primary key (partitionkey, rowkey, date)
Once you do that, I think your date range queries will function appropriately.
For more information on this, check out DataStax Academy's (free) course called Java Development With Apache Cassandra. Session 5, Module 104 discusses how to model time series data and that should help you out.

Resources