Cassandra CQL alternative to OR in WHERE clause - cassandra

Here's the code I used to create the table:
CREATE TABLE test.packages (
packageuuid timeuuid,
ruserid text,
suserid text,
timestamp int,
PRIMARY KEY (ruserid, suserid, packageuuid, timestamp)
);
and then I create a materialized view:
CREATE MATERIALIZED VIEW test.packages_by_userid
AS SELECT * FROM test.packages
WHERE ruserid IS NOT NULL
AND suserid IS NOT NULL
AND TIMESTAMP IS NOT NULL
AND packageuuid IS NOT NULL
PRIMARY KEY (ruserid, suserid, timestamp, packageuuid)
WITH CLUSTERING ORDER BY (packageuuid DESC);
I want to be able to search for packages sent between two IDs
so I would need something like this:
SELECT * FROM test.packages_by_userid WHERE (ruserid = '1' AND suserid = '2' AND suserid = '1' AND ruserid = '2') AND timestamp > 1496601553;
How would I accomplish something like this with CQL?
I've searched a bit but I can't figure it out.
I'm willing to change the structure of the table if it will make something like this possible.
If it's doable without a materialized view that would also be good.

Use In Clause:
SELECT * FROM test.packages_by_userid WHERE ruserid IN ( '1', '2') AND suserid IN ( '1','2') AND timestamp > 1496601553;
Note : Keep the in clause size smaller, Large in clause in the partition can cause GC pauses and heap pressure that leads to overall slower performance
In practical terms this means you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing.
If the multiple partition in clause larger try to use separate query, for each partition (ruserid) with executeAsync.
SELECT * FROM test.packages_by_userid WHERE ruserid = '1' AND suserid IN ( '1','2') AND timestamp > 1496601553;
SELECT * FROM test.packages_by_userid WHERE ruserid = '2' AND suserid IN ( '1','2') AND timestamp > 1496601553;
Learn More : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Since you always search for both sender and receiver, I'd model this with the following table layout:
CREATE TABLE test.packages (
ruserid text,
suserid text,
timestamp int,
packageuuid timeuuid,
PRIMARY KEY ((ruserid, suserid), timestamp)
);
In this way, for each pair of sender/receiver you need to run two queries, one for each partition:
SELECT * FROM packages WHERE ruserid=1 AND suserid=2 AND timestamp > 1496601553;
SELECT * FROM packages WHERE ruserid=2 AND suserid=1 AND timestamp > 1496601553;
This is IMHO the best solution because, remember, in Cassandra you start from your queries and build your table models on that, never the reverse.

Related

How to make a sequence of select, update and insert atomic in one single Cassandra statement?

I'm dealing with 1MLN of Tweets (with a frequency of about 5K at seconds) and I would like to do something similar to this code in Cassandra. Let's say that I'm using a Lambda Architecture.
I know the following code is not working, I just would like to explain my logic through it.
DROP TABLE IF EXISTS hashtag_trend_by_week;
CREATE TABLE hashtag_trend_by_week(
shard_week timestamp,
hashtag text ,
counter counter,
PRIMARY KEY ( ( shard_week ), hashtag )
) ;
DROP TABLE IF EXISTS topten_hashtag_by_week;
CREATE TABLE topten_hashtag_by_week(
shard_week timestamp,
counter bigInt,
hashtag text ,
PRIMARY KEY ( ( shard_week ), counter, hashtag )
) WITH CLUSTERING ORDER BY ( counter DESC );
BEGIN BATCH
UPDATE hashtag_trend_by_week SET counter = counter + 22 WHERE shard_week='2021-06-15 12:00:00' and hashtag ='Gino';
INSERT INTO topten_hashtag_trend_by_week( shard_week, hashtag, counter) VALUES ('2021-06-15 12:00:00','Gino',
SELECT counter FROM hashtag_trend_by_week WHERE shard_week='2021-06-15 12:00:00' AND hashtag='Gino'
) USING TTL 7200;
APPLY BATCH;
Then the final query to satisfy my UI should be something like
SELECT hashtag, counter FROM topten_hashtag_by_week WHERE shard_week='2021-06-15 12:00:00' limit 10;
Any suggesting ?
You can only have CQL counter columns in a counter table so you need to rethink the schema for the hashtag_trend_by_week table.
Batch statements are used for making writes atomic in Cassandra so including a SELECT statement does not make sense.
The final query for topten_hashtag_by_week looks fine to me. Cheers!

Run query on specific partition of partitioned MySQL table

I would like to run my Ecto.Query.from on a specific partition of a partitioned MySQL table.
Example table:
CREATE TABLE `dogs` (
`dog_id` bigint(20) unsigned NOT NULL,
...
PRIMARY KEY (`dog_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (dog_id)
PARTITIONS 10 */
Idealistic query for what I would like to achieve:
from(i in dogs, select: i.dog_id, partition: "p1")
The above doesn't work ofc, so
I have achieved this with transforming the query to string with
Ecto.Adapters.SQL.to_sql and editing it.
... <> "PARTITION (#{partition}) AS" <> ...
This feels hacky and it might break with future versions,
is there a way to achieve this with Ecto?

CQL table design for temporal data

As a Cassandra novice, I have a CQL design question. I want to re-use a concept which I've build before using RDBMS systems, to create history for customerData. The customer himself will only see the latest version, so that should be the fastest, but queries on whole history can be performed.
My suggested entity properties:
customerId text,
validFromDate date,
validUntilDate date,
customerData text
First save of customerData just INSERTs customerData with validFromDate=NOW and validUntilDate=31-12-9999
Subsequent saves of customerData changes the last record - setting validUntilDate=NOW - and INSERT new customerData with validFromDate=NOW and validUntilDate=31-12-9999
Result:
This way a query of (customerId, validUntilDate)=(id,31-12-9999) will give last saved version.
Query on (customerId) will give all history.
To query customerData at certain time t just use query with validFromDate < t < validUntilDate
My guess is PARTITION_KEY = customerId and CLUSTER_KEY can be validFromDate. Or use PRIMARY KEY = customerId. Or I could create two tables, one for fast querying of lastest version (has no history), and another for historical analyses.
How do you design this in CQL-way? I think I'm thinking too much RDBMish.
Use change timestamp as CLUSTERING KEY with DESC order, e.g
CREATE TABLE customer_data_versions (
id text,
change_time timestamp,
name text,
PRIMARY KEY (id, change_time)
) WITH CLUSTERING ORDER BY ( change_time DESC );
It will allow you to store data versions per customer id in descending order.
Insert two versions for the same id:
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John');
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John Doe');
Get last saved version:
SELECT * FROM customer_data_versions WHERE id='id1' LIMIT 1;
Get all versions for the id:
SELECT * FROM customer_data_versions WHERE id='id1';
Get versions between dates:
SELECT * FROM customer_data_versions WHERE id='id1' AND change_time <= before_date AND change_time >= after_date;
Please note, there are some limits for partition size (how much versions you will be able to store per customer id):
Cells in a partition: ~2 billion (231); single column value size: 2 GB ( 1 MB is recommended)

nested map in cassandra data modelling

I have following requirement of my dataset, need to unserstand what datatype should I use and how to save my data accordingly :-
CREATE TABLE events (
id text,
evntoverlap map<text, map<timestamp,int>>,
PRIMARY KEY (id)
)
evntoverlap = {
'Dig1': {{'2017-10-09 04:10:05', 0}},
'Dig2': {{'2017-10-09 04:11:05', 0},{'2017-10-09 04:15:05', 0}},
'Dig3': {{'2017-10-09 04:11:05', 0},{'2017-10-09 04:15:05', 0},{'2017-10-09 04:11:05', 0}}
}
This gives an error :-
Error from server: code=2200 [Invalid query] message="Non-frozen collections are not allowed inside collections: map<text, map<timestamp, int>>"
How should I store this type of data in single column . Please suggest datatype and insert command for the same.
Thanks,
There is limitation of Cassandra - you can't nest collection (or UDT) inside collection without making it frozen. So you need to "froze" one of the collections - either nested:
CREATE TABLE events (
id text,
evntoverlap map<text, frozen<map<timestamp,int>>>,
PRIMARY KEY (id)
);
or top-level:
CREATE TABLE events (
id text,
evntoverlap frozen<map<text, map<timestamp,int>>>,
PRIMARY KEY (id)
);
See documentation for more details.
CQL collections limited to 64kb, if putting things like maps in maps you might push that limit. Especially with frozen maps you are deserializing the entire map, modifying it, and re inserting. Might be better off with a
CREATE TABLE events (
id text,
evnt_key, text
value map<timestamp, int>,
PRIMARY KEY ((id), evnt_key)
)
Or even a
CREATE TABLE events (
id text,
evnt_key, text
evnt_time timestamp
value int,
PRIMARY KEY ((id), evnt_key, evnt_time)
)
It would be more efficient and safer while giving additional benefits like being able to order the event_time's in ascending or descending order.

Does using all fields as a partitioning keys in a table a drawback in cassandra?

my aim is to get the msgAddDate based on below query :
select max(msgAddDate)
from sampletable
where reportid = 1 and objectType = 'loan' and msgProcessed = 1;
Design 1 :
here the reportid, objectType and msgProcessed may not be unique. To add the uniqueness I have added msgAddDate and msgProcessedDate (an additional unique value).
I use this design because I don't perform range query.
Create table sampletable ( reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType,msgAddDate,msgProcessedDate));
Design 2 :
create table sampletable (
reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType),msgAddDate, msgProcessedDate))
);
Please advice which one to use and what will be the pros and cons between two based on performance.
Design 2 is the one you want.
In Design 1, the whole primary key is the partition key. Which means you need to provide all the attributes (which are: reportid, msgProcessed, objectType, msgAddDate, msgProcessedDate) to be able to query your data with a SELECT statement (which wouldn't be useful as you would not retrieve any additional attributes than the one you already provided in the WHERE statemenent)
In Design 2, your partition key is reportid ,msgProcessed,objectType which are the three attributes you want to query by. Great. msgAddDate is the first clustering column, which will be automatically sorted for you. So you don't even need to run a max since it is sorted. All you need to do is use LIMIT 1:
SELECT msgAddDate FROM sampletable WHERE reportid = 1 and objectType = 'loan' and msgProcessed = 1 LIMIT 1;
Of course, make sure to define a DESC sorted order on msgAddDate (I think by default it is ascending...)
Hope it helps!

Resources