Cassandra chat table design - cassandra

For my chat table design in cassandra I have the following scheme:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId, fromUserId), date)
) WITH CLUSTERING ORDER BY (date ASC);
The following query:
SELECT * FROM public_messages WHERE chatroomid=? LIMIT 20
Results in the typical message:
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING;
Obviously I'm doing something wrong with the partitioning here.
I'm not experienced with Cassandra and a bit confused about online suggestions that Cassandra will make an entire table scan, which is something that I don't really get realistically. Why would I want to fetch an entire table.
Another suggestion I read about is to create partitioning, e.g. to fetch the latest per day. But this doesn't work for me. You don't know when the latest chat message occurred.
Could be last day, last hour, or last week or month for that matter.
I'm pretty much used to sql or nosql like mongo, but this simple use case seems to be a problem for Cassandra. So what is the recommended approach here?
Edit:
It seems that it is common practise to add a bucket integer.
Let's say I create a bucket per 50 messages, is there a way to auto-increment it when the bucket is full?
I would prefer not having to do a fetch of MAX bucket and calculate when the bucket is full. Seems like bad performance for doing inserts.
Also it seems like a bad idea to manage the buckets in Java. Things like app restarts or load balancing would require extra logic.
(I currently use Java Spring JPA for Cassandra).

It works without bucketing using the following table design:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId), date)
) WITH CLUSTERING ORDER BY (date DESC);
I had to remove the fromUserId from the partition key, I assume it is required to include it in the where clause to avoid the error.
The jpa query:
publicMessageRepository.findFirst20ByPkChatRoomIdOrderByPkDateDesc(chatRoomId);

Related

How to search record using ORDER_BY without the partition keys

I'm debugging an issue and the logs should be sitting on a time range between 4/23/19~ 4/25/19
There are hundreds of millions of records on our production.
It's impossible to locate the target records using random sort.
Is there any workaround to search in a time range without partition key?
select * from XXXX.report_summary order by modified_at desc
Schema
...
"modified_at" "TimestampType" "regular"
"record_end_date" "TimestampType" "regular"
"record_entity_type" "UTF8Type" "clustering_key"
"record_frequency" "UTF8Type" "regular"
"record_id" "UUIDType" "partition_key"
First, ORDER BY is really quite superfluous in Cassandra. It can only operate on your clustering columns within a partition, and then only on the exact order of the clustering columns. The reason for this, is that Cassandra reads sequentially from the disk, so it writes all data according to the defined clustering order to begin with.
So IMO, ORDER BY in Cassandra is pretty useless, except for cases where you want to change the sort direction (ascending/descending).
Secondly, due to its distributed nature, you need to take a query-oriented approach to data modeling. In other words, your tables must be designed to support the queries you intend to run. Now you can find ways around this, but then you're basically doing a full table scan on a distributed cluster, which won't end well for anyone.
Therefore, the recommended way to go about that, would be to build a table like this:
CREATE TABLE stackoverflow.report_summary_by_month (
record_id uuid,
record_entity_type text,
modified_at timestamp,
month_bucket bigint,
record_end_date timestamp,
record_frequency text,
PRIMARY KEY (month_bucket, modified_at, record_id)
) WITH CLUSTERING ORDER BY (modified_at DESC, record_id ASC);
Then, this query will work:
SELECT * FROM report_summary_by_month
WHERE month_bucket = 201904
AND modified_at >= '2019-04-23' AND modified_at < '2019-04-26';
The idea here, is that as you care about the order of the results, you need to partition by something else to allow for sorting to work. For this example, I picked month, hence I've "bucketed" your results by month into a partition key called month_bucket. Within each month, I'm clustering on modified_at in DESCending order. This way, the most-recent results are at the "top" of the partition. Then, I threw in record_id as a tie-breaker key to help ensure uniqueness.
If you're still focused on doing this the wrong way:
You can actually run a range query on your current schema. But with "hundreds of millions of records" across several nodes, I don't have high hopes for that to work. But you can do it with the ALLOW FILTERING directive (which you shouldn't ever really use).
SELECT * FROM report_summary
WHERE modified_at >= '2019-04-23'
AND modified_at < '2019-04-26' ALLOW FILTERING;
This approach has the following caveats:
With many records across many nodes, it will likely time out.
Without being able to identify a single partition for this query, a coordinator node will be chosen, and that node has a high chance of becoming overloaded.
As this is pulling rows from multiple partitions, a sort order cannot be enforced.
ALLOW FILTERING makes Cassandra work in ways that it really wasn't designed to, so I would never use that on a production system.
If you really need to run a query like this, I recommend using an in-memory aggregation tool, like Spark.
Also, as the original question was about ORDER BY, I wrote an article a while back which better explains this topic: https://www.datastax.com/dev/blog/we-shall-have-order

Cassandra secondary vs extra table and read

I'm facing a dilemma that my small knowledge of Cassandra doesn't allow me to solve.
I have a index table used to retrieve data from an item (a notification) using an external id. However, the data contained in that table (in that case the status of the notification) is modified so I need to update the index table as well. Here is the tables design:
TABLE notification_by_external_id (
external_id text,
partition_key_date text,
id uuid,
status text,
...
PRIMARY KEY (external_id, partition_key_date, id)
);
TABLE notification (
partition_key_date text,
status text,
id uuid,
...
PRIMARY KEY (partition_key_date, status, id)
);
The problem is that when I want to update the notification status (and hence the notification_by_external_id table), I don't have access to the external ID.
So far I came up to 2 solutions, none of which seems optimal, and I can't decide which one to go with.
Solution 1
Create an index on notification_by_external_id.id, but this will obviously be a high cardinality column. There can be several external IDs for each notifications, but we're talking about something around 5-10 to one top.
Solution 2
Create a table
TABLE external_id_notification (
notification_id uuid,
external_id text
PRIMARY KEY (notification_id, external_id)
);
but that would mean making one extra read operation (and of course maintain another table) which I understood is also a bad practice.
The thing to understand about secondary indexes is, that their scalability issue is not with the number of rows in the table, but with the amount of nodes in your cluster. A select on an index column means that every single node will have to process it and respond to it, just that it itself will be able to process the select efficiently.
Use secondary indexes for administrative purposes (i.e. you on cqlsh) only. Do not use it for productive purposes.
That being said. You could duplicate all the information into your external_id_notification table. That would alleviate the need for an extra read operation. I know that relational databases taught you, that duplicate data is bad (what if it differs?), and that you should always normalize. But you are not on a relational database. Denormalization is a thing, and on Cassandra, you should always go for that, unless you absolutely cannot.

Cassandra data modeling - Do I choose hotspots to make the query easier?

Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.
Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.

select older versions of data after update in Cassandra

This is my use-case.
I have inserted a row of data in Cassandra with the following query:
INSERT INTO TableWide1 (UID, TimeStampCol, Value, DateCol) VALUES ('id1','2016-03-24 17:54:36',45,'2015-03-24 00:00:00');
I update one row to have a new value.
update TableWide1 set Value = 46 where uid = 'id1' and datecol='2015-03-24 00:00:00' and timestampcol='2016-03-24 17:54:36';
Now, I would like to see all versions of this data from Cassandra. I know in HBase, this is pretty straightforward, but in Cassandra, is this even possible?
I explored a bit using writetime(), but it just gives the latest time of the newly updated data. And this cannot be used in where clause too.
This is how my schema looks like:
CREATE TABLE TableWide1(
UID varchar,
TimeStampCol timestamp,
Value double,
DateCol timestamp,
PRIMARY KEY ((UID,DateCol), TimeStampCol)
);
So is this technically possible, given the fact the old data still exists in Cassandra?
If your partitions wont get too wide you could exclude the time partitioning:
CREATE TABLE table_wide (
UID varchar,
TimeStampCol timestamp,
Value double,
PRIMARY KEY ((UID), TimeStampCol)
);
Thats generally bad though since eventually you will hit the limits of a partition.
But really you had it right. You wont be able to make a single statement, but under the covers you cant stream the entire set over anyway, and it will have to page through it. So you can just iterate through results of each day one at a time. If your dataset has days with no data and you dont want to waste reads, you can keep an additional table around to mark which days have data
CREATE TABLE table_wide_partition_list (
UID varchar,
DateCol timestamp,
PRIMARY KEY (UID)
);
And make one query to it first.
Really if you want HBase like behavior for scans, you are probably looking for more OLAP style of thing instead of normal C* usage. For this its almost universally recommended to use Spark with Cassandra currently.
Cassandra does not retain old data when updated.
It marks the old data into tombstone, and get rid of this, when compaction happens.
Hbase, was not made for handling real time application, and hot data from/for application server, though things have improved since the old times with Hbase.
People use Hbase, mainly because they already have a hadoop cluster.
Another noticeable and important difference is Cassandra is very fast on retrieval of single/multiple record based on key but not on range like >10 && <10 because data is stored based on hashed key. Hbase on the other hand stores data in sorted manner and is ideal candidate for range query.
Anyways, since cassandra doesn't retain old data. You cannot retrieve it.

Cassandra - Datamodel

i'm new to Cassandra and i'm not certain how to model my data.
Lets assume the object i want to store look like this:
C_ID,
VALUE,
TIMESTAMP,
D_TYPE,
E_ID
Currently i'm storing the data in a RDBMS with C_ID and Timestamp as Primary Key.
I'm aware that i should model my data around the selects/deletes i have to do in NoSQL.
The Select i want to would always have an C_ID, but can additionaly contain timestamp, d_type and E_id
SO manditory: C_ID
and maybe: timestamp, d_type, e_id
The delete i want to go through time.
Delete from columnfamliy where time < Date;
And here comes the problem. As far as i've researched: The primary key from a cf has to be in the select and in the delete.
So in my case that would make timestamp the only possible primary key since i have it in select + delete.
And that a big problem since i have get data from multiple maschines and i cant garrentie that a time (even to the ms, will only happen one) and another problem is that i cant really query over range anymore.
And i can only delete with IN or = syntax, and if i have to delete for example 1 day i'm looking at a batch statement with 86400000 delete (with ms).
So atm i'm totally stuck with the problem =/
Anyone knows anything helpful?
Thanks in advance.
-- UPDATE 1
I'm using C# and the datastax driver atm.

Resources