Does having unique and lot of small partition in a table effect performance or create extra load in cassandra - cassandra

I have a table with 4 million unique partition keys
select count(*) from "KS".table;
count
4355748
(1 rows)
I have read the cardinality of Partition Key should not too high and also not too low, which
means don’t make partition key too unique. Is it correct?
The table does not have any clustering key. Will changing data partitioning help with the load?

It really depends on the use case... If you don't have natural clustering by partition, then maybe little sense to introduce it. Also, what are the read patterns? Do you need to read multiple rows in one go, or not?
Number of partitions has an effect on the size of the bloom filter, key cache, etc., so as you increase the number of partitions, bloom filter is increased, and key cache has less hits (until you increase its size).

As far as I know, Cassandra is using consistent hashing for mapping partition key to physical partition, so cardinality should not matter.

Related

cassandra partion key grow limit?

what means partions my grow large ? I think cassandra can handle a very large size of it. Why they use in this example 2 partion keys ?
And what I do maybe both partiton keys are too large ?
The example which you gave is one of the ways for preventing partitions to become too large. In Cassandra partition key ( part of primary key) is used for grouping similar set of rows.
Here in left side data model, user_id is the partition key which means every video interaction by that user will be placed in same partition. As mentioned in example comment, if user is active and has 1000 interaction daily then in 60 days (2 months) you will have 60000 rows for that user. This may breach Cassandra permissible partition size (in terms of data size stored in single partirion).
So to avoid this situation there are many ways you can avoid partition size to grow too big. For example, you can do
Make another column from that table a part of partition key. This is done in the example above. The video_id is made part of partition key along with user_id.
Bucketing - This is the strategy which is used in time series data generally where you make multiple buckets of a partition key. For example if date is your partition key then you can create 24 buckets as date_1, date_2,.....,date_24. Now you have divided your partition key into smaller partition keys and hence you divided one big partition into 24 small partitions.
The main idea is to avoid your partition to grow too big in size. This is a data modeling technique which one should be aware of while creating data model for Cassandra.
If still you have large partition size, you need to remodel your data model based on various data modelling techniques available. For that I would recommend understand your data, estimate rate of growth, calculate estimated size of partition and if your data model is not meeting the partition size demand then refine your data model.

Wide partition pattern in Cassandra

What is 'wide partition pattern' in Cassandra? In book 'Defiinitive Cassandra' it seems its a recommended thing, but in some online articles I see its something to be avoided.
So what actually it is and it is preferrable or not?
Partition in Cassandra represent grouping of similar kind of rows. In Cassandra it is recommended to model your data such that you should have similar kind of rows fall in same partition. This is called wide partition pattern.
Searching in Cassandra is super fast using partition key. So wide partition pattern is recommened. But with this recommendation comes a warning that your wide partitions should not become too large.
The reason for warning (avoiding large partitions) is that searching becomes too slow as search within partition is slow. Also it puts lot of pressure on heap.
For better understanding, would recommend reading this blog https://thelastpickle.com/blog/2019/01/11/wide-partitions-cassandra-3-11.html
wide partition refers a partition which contains many cells/values (number_of_cells = column * row)
The commonly used time series design pattern is a great example for wide partition table design: You use the date bucket as the partition key, each bucket will contain all timestamps for the date range defined by that bucket.
This model illustrates why it's called "wide" partition
another dynamodb example, the idea is the same (sort key in dynamodb is the same as clustering key in cassandra)

How can I reduce or is it necessary to reducing partition count for large amount of data in Cassandra?

I have estimated ~500 million rows data with 5 million unique numbers. My query must get data by number and event_date. number as partition key, there will be 5 million partitions. I think it is not good that exists a lot of small partitions and timeouts occurs during query. I'm in trouble with defining partition key. I have found some synthetic sharding strategies, but couldn't apply for my model. I can define partition key by mod number, but then rows aren't distributed balanced among partitions.
How can I model this for reducing or is it necessary to reducing partition count? Is there any partition count limit?
CREATE TABLE events_by_number_and_date (
number bigint,
event_date int, /*eg. 20200520*/
event text,
col1 int,
col2 decimal
PRIMARY KEY (number, event_date)
);
For your query, change of the data model won't help, as you're using the query that is unsuitable for Cassandra. Although Cassandra supports aggregations, such as, max, count, avg, sum, ..., they are designed for work inside single partition, and not designed to work in the whole cluster. If you issue them without restriction on the partition key, coordinating node, need to reach every node in the cluster, and they will need to go through all the data in the cluster.
You can still do this kind of query, but it's better to use something like Spark to do that, as it's heavily optimized for parallel data processing, and Spark Cassandra Connector is able to correctly perform querying of the data. If you can't use Spark, you can implement your own full token range scan, using code similar to this. But in any case, don't expect that there will be a "real-time" answer (< 1sec).

What is the time complexity(Big O) of Cassandra operations?

Assume there is only one node with R rows. What is the theoretical time complexity of basic Cassandra operations?
More specifically, I want to know:
key = item. I assume it to be O(log(R)) is it right?
key > item, i.e. slice. Will C* fetch all R rows to judge if the condition is met, which results in O(R)? What about ordered rows?
key > 10 AND key < 12. Will C* first select all that matches key > 10 and then filter with key < 12? Or C* will combine them into a single condition for query?
You don't clarify if you mean reads or wites, although it seem you are talking about read operations. The read path in Cassandra is highly optimized with diffent read caches, bloom filters and different compaction strategies (STCS, LTCS, TWCS) for how the data is structured on disk. Data is written on disk in one or more SSTables and the presence of tombstones degrades read performance, sometimes significantly.
The Cassandra architecture is designed to provide linear scaleability as data volumes grow. The premise of having just a single node would be the major limiting factor in read latency as the number of rows R becomes large.

Apache Cassandra Several Partition Keys or Single Computed Key?

I am fairly new to Apache Cassandra and one thing I am having a hard time understanding is whether I should have a table with several partition keys or a single computed key (computed in a application layer).
In my specific case I have 16 partition keys k1...k16 that make a single data element unique. With several partition keys I need to provide them in my select statement and I am okay with this, but are there any pros/cons of doing this in terms of storage and or performance?
The way I understand this is the storage might be more, but the partition keys are 'human readable' and potentially queryable by other clients of this data. I assume that cassandra computes some hash on my partition keys whether it's a single value or several.
My question is there storage/performance issues or any other considerations I should think about with having several partition keys or single application computed partition key?
You are correct, Cassandra converts a multi-part partition key into a single hash. So, I think any efficiencies gains from computing the hash in your application would be minimal at best.
Also, just in case you don't know this, keep in mind that the primary key is divided into the partition key and the clustering keys.
Cheers
Ben

Resources