Cassandra data modeling - Do I choose hotspots to make the query easier? - cassandra

Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.

Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.

Related

Cassandra read perfomance slowly decreases over time

We have a Cassandra cluster that consists of six nodes with 4 CPUs and 16 Gb RAM each and underlying shared storage (SSD). I'm aware that shared storage considered a bad practice for Cassandra, but ours is limited at the level of 3 Gb/s on reads and seems to be reliable against exigent disk requirements.
The Cassandra used as an operational database for continuous stream processing.
Initially Cassandra serves requests at ~1,700 rps and it looks nice:
The initial proxyhistograms:
But after a few minutes the perfomance starts to decrease and becomes more than three times worse in the next two hours.
At the same time we observe that the IOWait time increases:
And proxyhistograms shows the following picture:
We can't understand the reasons that lie behind such behaviour. Any assistance is appreciated.
EDITED:
Table definitions:
CREATE TABLE IF NOT EXISTS subject.record(
subject_id UUID,
package_id text,
type text,
status text,
ch text,
creation_ts timestamp,
PRIMARY KEY((subject_id, status), creation_ts)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.c_record(
c_id UUID,
s_id UUID,
creation_ts timestamp,
ch text,
PRIMARY KEY(c_id, creation_ts, s_id)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.s_by_a(
s int,
number text,
hold_number int,
hold_type text,
s_id UUID,
PRIMARY KEY(
(s, number),
hold_type,
hold_number,
s_id
)
);
far from 100 Mb
While some opinions may vary on this, keeping your partitions in the 1MB to 2MB range is optimal. Cassandra typically doesn't perform well when returning large result set. Keeping the partition size small, helps queries perform better.
Without knowing what queries are being run, I can say that with queries which deteriorate over time... time is usually the problem. Take this PRIMARY KEY definition, for example:
PRIMARY KEY((subject_id, status), creation_ts)
This is telling Cassandra to store the data in a partition (hashed from a concatenation of subject_id and status), then to sort and enforce uniqueness by creation_ts. What can happen here, is that there doesn't appear to be an inherent way to limit the size of the partition. As the clustering key is a timestamp, each new entry (to a particular partition) will cause it to get larger and larger over time.
Also, status by definition is temporary and subject to change. For that to happen, partitions would have to be deleted and recreated with every status update. When modeling systems like this, I usually recommend status columns as non-key columns with a secondary index. While secondary indexes in Cassandra aren't a great solution either, it can work if the result set isn't too large.
With cases like this, taking a "bucketing" approach can help. Essentially, pick a time component to partition by, thus ensuring that partitions cannot grow infinitely.
PRIMARY KEY((subject_id, month_bucket), creation_ts)
In this case, the application writes a timestamp (creation_ts) and the current month (month_bucket). This helps ensure that you're never putting more than a single month's worth of data in a single partition.
Now this is just an example. A whole month might be too much, in your case. It may need to be smaller, depending on your requirements. It's not uncommon for time-driven data to be partitioned by week, day, or even hour, depending on the required granularity.

How to search record using ORDER_BY without the partition keys

I'm debugging an issue and the logs should be sitting on a time range between 4/23/19~ 4/25/19
There are hundreds of millions of records on our production.
It's impossible to locate the target records using random sort.
Is there any workaround to search in a time range without partition key?
select * from XXXX.report_summary order by modified_at desc
Schema
...
"modified_at" "TimestampType" "regular"
"record_end_date" "TimestampType" "regular"
"record_entity_type" "UTF8Type" "clustering_key"
"record_frequency" "UTF8Type" "regular"
"record_id" "UUIDType" "partition_key"
First, ORDER BY is really quite superfluous in Cassandra. It can only operate on your clustering columns within a partition, and then only on the exact order of the clustering columns. The reason for this, is that Cassandra reads sequentially from the disk, so it writes all data according to the defined clustering order to begin with.
So IMO, ORDER BY in Cassandra is pretty useless, except for cases where you want to change the sort direction (ascending/descending).
Secondly, due to its distributed nature, you need to take a query-oriented approach to data modeling. In other words, your tables must be designed to support the queries you intend to run. Now you can find ways around this, but then you're basically doing a full table scan on a distributed cluster, which won't end well for anyone.
Therefore, the recommended way to go about that, would be to build a table like this:
CREATE TABLE stackoverflow.report_summary_by_month (
record_id uuid,
record_entity_type text,
modified_at timestamp,
month_bucket bigint,
record_end_date timestamp,
record_frequency text,
PRIMARY KEY (month_bucket, modified_at, record_id)
) WITH CLUSTERING ORDER BY (modified_at DESC, record_id ASC);
Then, this query will work:
SELECT * FROM report_summary_by_month
WHERE month_bucket = 201904
AND modified_at >= '2019-04-23' AND modified_at < '2019-04-26';
The idea here, is that as you care about the order of the results, you need to partition by something else to allow for sorting to work. For this example, I picked month, hence I've "bucketed" your results by month into a partition key called month_bucket. Within each month, I'm clustering on modified_at in DESCending order. This way, the most-recent results are at the "top" of the partition. Then, I threw in record_id as a tie-breaker key to help ensure uniqueness.
If you're still focused on doing this the wrong way:
You can actually run a range query on your current schema. But with "hundreds of millions of records" across several nodes, I don't have high hopes for that to work. But you can do it with the ALLOW FILTERING directive (which you shouldn't ever really use).
SELECT * FROM report_summary
WHERE modified_at >= '2019-04-23'
AND modified_at < '2019-04-26' ALLOW FILTERING;
This approach has the following caveats:
With many records across many nodes, it will likely time out.
Without being able to identify a single partition for this query, a coordinator node will be chosen, and that node has a high chance of becoming overloaded.
As this is pulling rows from multiple partitions, a sort order cannot be enforced.
ALLOW FILTERING makes Cassandra work in ways that it really wasn't designed to, so I would never use that on a production system.
If you really need to run a query like this, I recommend using an in-memory aggregation tool, like Spark.
Also, as the original question was about ORDER BY, I wrote an article a while back which better explains this topic: https://www.datastax.com/dev/blog/we-shall-have-order

Regarding Cassandra's (sloppy, still confusing) documentation on keys, partitions

I have a high-write table I'm moving from Oracle to Cassandra. In Oracle the PK is a (int: clientId, id: UUID). There are about 10 billion rows. Right off the bat I run into this nonsensical warning:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html :
"If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index."
Not only does this seem to defeat efficient find by PK it fails to define what it means to "query between the fields" and what the difference is between a built-in index, a secondary-index, and the primary_key+clustering subphrases in a create table command. A junk description. This is 2019. Shouldn't this be fixed by now?
AFAIK it's misleading anyway:
CREATE TABLE dev.record (
clientid int,
id uuid,
version int,
payload text,
PRIMARY KEY (clientid, id, version)
) WITH CLUSTERING ORDER BY (id ASC, version DESC)
insert into record (id,version,clientid,payload) values
(d5ca94dd-1001-4c51-9854-554256a5b9f9,3,1001,'');
insert into record (id,version,clientid,payload) values
(d5ca94dd-1002-4c51-9854-554256a5b9e5,0,1002,'');
The token on clientid indeed shows they're in different partitions as expected.
Turning to the big point. If one was looking for a single row given the clientId, and UUID ---AND--- Cassandra allowed you to skip specifying the clientId so it wouldn't know which node(s) to search, then sure that find could be slow. But it doesn't:
select * from record where id=
d5ca94dd-1002-4c51-9854-554256a5b9e5;
InvalidRequest: ... despite the performance unpredictability,
use ALLOW FILTERING"
And ditto with other variations that exclude clientid. So shouldn't we conclude Cassandra handles high cardinality tables searches that return "very few results" just fine?
Anything that requires reading the entire context of the database wont work which is the case with scanning on id since any of your clientid partition key's may contain one. Walking through potentially thousands of sstables per host and walking through each partition of each of those to check will not work. If having hard time with data model and not totally getting difference between partition keys and clustering keys I would recommend you walk through some introduction classes (ie datastax academy), youtube videos or book etc before designing your schema. This is not a relational database and designing around your data instead of your queries will get you into trouble. When moving from oracle you should not just copy your tables over and move the data or it will not work as well.
The clustering key is the order in which the data for a partition is ordered on disk which is what it is referring to as "build-in index". Each sstable has an index component that contains the partition key locations for that sstable. This also includes an index of the clustering keys for each partition every 64kb (by default at least) that can be searched on. The clustering keys that exist between each of these indexed points are unknown so they all have to be checked. A long time ago there was a bloom filter of clustering keys kept as well but it was such a rare use case where it helped vs the overhead that it was removed in 2.0.
Secondary indexes are difficult to scale well which is where the warning comes from about cardinality, I would strongly recommend just denormalizing data and not using index in any form as using large scatter gather queries across a distributed system is going to have availability and performance issues. If you really need it check out http://www.doanduyhai.com/blog/?p=13191 to try to get the data right (not worth it in my opinion).

Cassandra how many columns/row for optimal performance?

I am writing a chat server and, want to store my messages in cassandra. Because I need range queries and I know that I will expect 100 messages/day and maintain history for 6 months I will have 18000 messages for a user at a point.
Now, since I'll do range queries I need my data to be on the same machine. Either I have to use ByteOrderPartitioner, which I don't understand fully, or I can store all the message for a user on the same row.
create table users_conversations(jid1 bigint, jid2 bigint, archiveid timeuuid, stanza text, primary key((jid1, jid2), archiveid)) with CLUSTERING ORDER BY (archiveid DESC );
So I'll have 18000 columns. Do you think I'll have performance problems using this cluster key approach?
If yes, what alternative do I have?
Thanks
Do not use the ByteOrderedPartitioner. I cannot stress enough how important that point is.
since I'll do range queries I need my data to be on the same machine.
With your PRIMARY KEY defined like this:
primary key((jid1, jid2), archiveid)
Your current partitioning keys (jid1 and jid2) will be combined and hashed so that all messages for specific values of jid1 and jid2 are stored together on the same partition. The drawback is that you will need both jid1 and jid2 for each query. But they will be sorted on archiveid, you will be able to query by range on archiveid, and it should perform well as long as you don't hit the 2 billion columns per partition limit.

Cassandra data model for application logs (billions of operations!)

Say, I want to collect logs from a huge application cluster which produces 1000-5000 records per second. In future this number might reach 100000 records per second, aggregated from a 10000-strong datacenter.
CREATE TABLE operation_log (
-- Seconds will be used as row keys, thus each row will
-- contain 1000-5000 log messages.
time_s bigint,
time_ms int, -- Microseconds (to sort data within one row).
uuid uuid, -- Monotonous UUID (NOT time-based UUID1)
host text,
username text,
accountno bigint,
remoteaddr inet,
op_type text,
-- For future filters — renaming a column must be faster
-- than adding a column?
reserved1 text,
reserved2 text,
reserved3 text,
reserved4 text,
reserved5 text,
-- 16*n bytes of UUIDs of connected messages, usually 0,
-- sometimes up to 100.
submessages blob,
request text,
PRIMARY KEY ((time_s), time_ms, uuid)) -- Partition on time_s
-- Because queries will be "from current time into the past"
WITH CLUSTERING ORDER BY (time_ms DESC)
CREATE INDEX oplog_remoteaddr ON operation_log (remoteaddr);
...
(secondary indices on host, username, accountno, op_type);
...
CREATE TABLE uuid_lookup (
uuid uuid,
time_s bigint,
time_ms int,
PRIMARY KEY (uuid));
I want to use OrderedPartitioner which will spread data all over the cluster by its time_s (seconds). It must also scale to dozens of concurrent data writers as more application log aggregators are added to the application cluster (uniqueness and consistency is guaranteed by the uuid part of the PK).
Analysts will have to look at this data by performing these sorts of queries:
range query over time_s, filtering on any of the data fields (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
pagination query from the results of the previous one (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND token(uuid) < token($uuid) AND $filters),
count messages filtered by any data fields within a time range (SELECT COUNT(*) FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
group all data by any of the data fields within some range (will be performed by application code),
request dozens or hundreds of log messages by their uuid (hundreds of SELECT * FROM uuid_lookup WHERE uuid IN [00000005-3ecd-0c92-fae3-1f48, ...]).
My questions are:
Is this a sane data model?
Is using OrderedPartitioner the way to go here?
Does provisioning a few columns for potential filter make sense? Or is adding a column every once in a while cheap enough to run on a Cassandra cluster with some reserved headroom?
Is there anything that prevents it from scaling to 100000 inserted rows per second from hundreds of aggregators and storing a petabyte or two of queryable data, provided that the number of concurrent queryists will never exceed 10?
This data model is close to a sane model, with several important modifications/caveats:
Do not use ByteOrderedPartitioner, especially not with time as the key. Doing this will result in severe hotspots on your cluster, as you'll do most of your reads and all your writes to only part of the data range (and therefore a small subset of your cluster). Use Murmur3Partitioner.
To enable your range queries, you'll need a sentinel key--a key you can know in advance. For log data, this is probably a time bucket + some other known value that's not time-based (so your writes are evenly distributed).
Your indices might be ok, but it's hard to tell without knowing your data. Make sure your values are low in cardinality, or the index won't scale well.
Make sure any potential filter columns adhere to the low cardinality rule. Better yet, if you don't need real-time queries, use Spark to do your analysis. You should create new columns as needed, as this is not a big deal. Cassandra stores them sparsely. Better yet, if you use Spark, you can store these values in a map.
If you follow these guidelines, you can scale as big as you want. If not, you will have very poor performance and will likely get performance equivalent to a single node.

Resources