How does quantity of partitions influence on repair time in Cassandra cluster?
Is it correct that the less quantity of partitions the faster speed of Merkle tree algorithm and repair procedure?
Will repair faster for -
CREATE TABLE ks.t1 (
id2 bigint,
id1 bigint,
name text,
PRIMARY KEY (id2, id1, name)
);
than for
CREATE TABLE ks.t1 (
id2 bigint,
id1 bigint,
name text,
PRIMARY KEY ((id2, id1), name)
);
If count(id2, id1) > count (id1) ?
When triggering repair, Cassandra will
read all SSTables locally on disk into memory
compute the Merkle Tree
exchange the Merkle Tree between different replicas
if there is a mismatch, a block of partitions will be sent on the
network
Because Merkle tree resolution only allow 32768 leaf nodes. If there are more than 32768 partitions on a single replica, there will be many partitions that hash into the same leaf node. So if a single partition mismatches, we'll need to send all the block of partitions. That's what I call over repair
This issue is solved more or less by sub-range repair where, instead of repairing the whole token range for a table, Cassandra just attempts to repair a portion of the token range. The direct result is the Merkle Tree resolution will be higher since there are less partitions to repair.
So yes, it seems that having less partitions will reduce over repair.
But ....
In your example, less partition == wider partition which is not ideal either.
Why ? Because if there is a single cell mismatch in a wide partition, Cassandra will need to repair the entire partition, which is a waste of resource.
Furthermore, wide partition will make read path slower because the data is likely to span on many SSTables.
Conclusion, I would personally prefer PRIMARY KEY ((id2, id1), name) and use sub-range repair.
Related
From my understanding Apache Cassandra partitions each row in a table into a separate partition located in separate nodes. In that case, if we consider a table having millions of records or rows, Cassandra will partition the records to millions of Nodes.
My doubt is "What if adequate nodes are not available to store each record in case of a table with millions of records which is continuously growing?"
Your understanding is wrong. The three main keywords used in your question are partition, rows and node. Now consider how are they defined
Node represents the Cassandra process running on a virtaul machine/baremetal/cloud.
Partition represents a logical entity which helps Cassandra cluster to know on which node requested data resides. Primary key should be unique.
Row represent a record contained within a partition. A partition can contain millions of rows.
Based on your partition key your Cassandra cluster will identify on which node the data will reside. If you have three nodes, then Cassandra will take hash of your partition key and based on that value node will be identified where data will be written. So as you scale, hash numbers will be redistributed (along with them partitions will be distributed).
So even if you millions of records, they can reside in single node if your Cluster has one node and if you multiple nodes, your data will be distributed almost equally among nodes.
What is the best tool to find the number of rows in each Cassandra partition? I have a big partition and I want to know how much records are there in that partition
nodetool tablehistograms <keyspace> <table> will give you the distribution of the cells and sizes of thee partition for the table. But that does not give you for sure that partition. To get the specific one you must use count(*) on a select query that specifies the partition key in where clause. A very large partition and that can fail though.
sstablemetadata after 4.0 is based off the describe command in sstable-tools. It will give you the partitions largest in size, and largest in number of rows, and the partitions with most tombstones if you provide the -s to scan the sstable. These can be used against 3.0 and 3.11 sstables. I think 2.1 sstables are not able to be processed though.
...
Partitions: 22515
Rows: 13579337
Tombstones: 0
Cells: 13579337
Widest Partitions:
[12345] 999999
[99049] 62664
[99007] 60437
[99017] 59728
[99010] 59555
Largest Partitions:
[12345] 189888705
[99049] 2965017
[99007] 2860391
[99017] 2826094
[99010] 2818038
...
above example has parititon key an int, it will print out key like:
Widest Partitions:
[frodo] 1
Largest Partitions:
[frodo] 104
You can find the total number of partitions available for a table with nodetool command. ./nodetool cfstats <keyspace>.<table>.
If you know the partition key, you can fire a select count(*) for the partition to get no. of the records in that partition. It's possible that query can timeout for count queries on big partitions set cqlsh request-timeout before executing the query.
To understand how to calculate the physical partition size, go through the Datastax DS220: Data Modeling partition size
Instaclustr has a tool to find the partition size. However, this does not show the number of records in each partition:
https://github.com/instaclustr/cassandra-sstable-tools
As mentioned above either use inbuilt node tool, which could be find within Cassandra Folder extracted from jar , and run nodetool inside terminal .
nodetool toppartitions
Additionally you can also use online tool such as : https://www.cqlguru.io/ , but this need some prior information as vaerage number of rows per partition, average number of text in varchar and all . But this tool is good for rough estimation.
I've understood difference b/w Cassandra Partition key, Composite key, Clustering key. But not finding enough information to understand how partition is handled in cassandra.
In cassandra, range of partition keys are stored on a node like a partition/shard. Is my understanding is correct or not..?
Is each partition key has different file(at the system level) in DB..? If so, won't the reads be slower..?
If each partition key is not having different file in DB. How it's handled..?
Data is stored in Cassandra in wide rows called partitions. Each row has a partition key used for identifying that partition. For distributing the data across the cluster, Cassandra is using partitioners which are basically computing hashes of the partition key and the data is distributed across the cluster based on these values. The default partitioner in Cassandra is Murmur3Partitioner.
At OS level, the data is stored in sstables files. A partition can be spread across many sstables. That's why you also need compaction, which is the process of consolidating those sstables, so your partitions won't be spread across a lot of sstables. Reducing the number of sstables a partitions is spread across, will also improve read time. It's worth noting that sstables are immutable.
I suggest reading this, especially "How Cassandra reads and writes data".
I'm working on designing Cassandra column family.
I met with a situation of higher GC while SELECTing, after loading a higher density of data. That is, amount of data in a partition increased. Also for low density data, it works fine.
I want to know how Cassandra does the SELECT query (with both partition and cluster key specified)?
Is the whole set of data in a partition is loaded into memory while we execute SELECT?
Will large number of partition keys affect performance?
Cassandra does not load the entire partition into memory, but it does load IndexInfo objects which help Cassandra find the relevant CQL rows within the partition. These are short lived java objects which can create quite a bit of heap pressure (GC pauses) - this is a design issue that will be addressed in CASSANDRA-9754 (known as Birch, a b-tree implementation of the index data structure).
Until cassandra-4.0 is released, you should target 100MB for your max partition size, and break larger partitions into smaller pieces.
We modeled our data in cassandra table with partition key, lets say "pk". We have a total of 100 unique values for pk and our cluster size is 160. We are using random partitioner. When we add data to Cassandra (with replication factor of 3) for all 100 partitions, I noticed that those 100 partitions are not distributed evenly. One node has as many as 7 partitions and lot of nodes only has 1 or no partition. Given that we are using random partitioner, I expected the distribution to be reasonably even. Because 7 partitions are in the same node, thats creating a hot partition for us. Is there a better way to distribute partitions evenly?
Any input is appreciated.
Thanks
I suspect the problem is the low cardinality of your partition key. With only 100 possible values, it's not unexpected that several values end up hashing to the same nodes.
If you have 160 nodes, then only having 100 possible values for your partition key will mean you aren't using all 160 nodes effectively. An even distribution of data comes from inserting a lot of data with a high cardinality partition key.
So I'd suggest you figure out a way to increase the cardinality of your partition key. One way to do this is to use a compound partition key by including some part of your clustering columns or data fields into your partition key.
You might also consider switching to the Murmur3Partitioner, which generally gives better performance and is the current default partitioner on the newest releases. But you'd still need to address the low cardinality problem.