Regarding Cassandra Table Size - cassandra

How to calculate the total size of a keyspace in cassandra?
I have tried the nodetool cfstats and nodetool tablestats command. It is giving a lot of information, but I am not sure which field provides the exact information.
Can anybody suggest any method to find out the size of a keyspace and a table in Cassandra?

"nodetool tablestats" replaces the older command "nodetool cfstats". In other words both are the same. Output of this command lists the size of each of the tables within a keyspace.
Amongst the output, you are looking for "Space used (total)" value. Its the Total number of bytes of disk space used by SSTables belonging to this table, including obsolete SSTables waiting to be GCd.
Since there could be multiple tables within a keyspace, you need to sum up "Space used (total)" for all tables belonging to a keyspace to get size occupied by keyspace.
Another alternative if you have SSH access to the nodes, is to get to Cassandra Data directory and issue "du -h" to get the size of each keyspace directory. Again sum up the directory size on all nodes for that keyspace (ignoring the snapshot sizes).

Related

nodetool tablehistograms command returned "No SSTables exists, unable to calculate 'Partition Size' and 'Cell Count' percentiles"

command I ran was
nodetool tablehistograms <keyspace> <table>
The bug was
No SSTables exists, unable to calculate 'Partition Size' and 'Cell Count' percentiles
I tried to calculate partition size for better selections on partition keys, but nodetool command did not work fine as the partition size is not provided with this error
SSTables are immutable as far as concerned, and I do not know if I should (and how to) create SSTables based on existed ones?
Experts, please come solve this problem, really appreciate it.
Best
How exact do you need to be when measuring the partition sizes?
For a quick estimate, 'nodetool tablestats <keyspace.table>' will give you the min, max and avg partition size.
If a more accurate measurement is needed, you could download and use DSBulk and run the count option to pull the largest n partitions for a table, which will also print the key, for example:
dsbulk count --stats.modes partitions --stats.numPartitions <n> -k myKeyspace -t myTable
There are no histograms available for the command to report if there are no SSTables on disk.
The nodetool tablehistograms command collects the metrics from the SSTables but if there are none stored on disk then there is nothing for the command to report.
Make sure that the table contains data in the data/ directory then try again. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

How to purge massive data from cassandra when node size is limited

I have a Cassandra cluster (with cassandra v3.11.11) with 3 data centers with replication factor 3. Each node has 800GB nvme but one of the data table is taking up 600GB of data. This results in below from nodetool status:
DataCenter 1:
node 1: 620GB (used) / 800GB (total)
DataCenter 2:
node 1: 610GB (used) / 800GB (total)
DataCenter 3:
node 1: 680GB (used) / 800GB (total)
I cannot install another disk to the server and the server does not have internet connection at all for security reasons. The only thing I can do is to put some small files like scripts into it.
ALL cassandra tables are set up with SizeTieredCompactionStrategy so I end up with very big files ~one single 200GB sstable (full table size 600). Since the cluster is running out of space, I need to free the data which will introduce tombstones. But I have only 120GB of free space left for compaction or garbage collection. That means I can only retain at most 120GB data, and to be safe for the Cassandra process, maybe even less.
I am executed nodetool cleanup so I am not sure if there is more room to free.
Is there any way to free 200GB of data from Cassandra? or I can only retain less than 120 GB data?
(if we remove 200GB data, we have 400GB data left, compaction/gc will need 400+680GB data which is bigger than disk space..)
Thank you!
I personally would start with checking if the whole space is occupied by actual data, and not by snapshots - use nodetool listsnapshots to list them, and nodetool clearsnapshot to remove them. Because if you did snapshot for some reason, then after compaction they are occupying the space as original files were removed.
The next step would be to try to cleanup tombstones & deteled data from the small tables using nodetool garbagecollect, or nodetool compact with -s option to split table into files of different size. For big table I would try to use nodetool compact with --user-defined option on the individual files (assuming that there will be enough space for them). As soon as you free > 200Gb, you can sstablesplit (node should be down!) to split big SSTable into small files (~1-2Gb), so when node starts again, the data will be compacted

why full replication Cassandra cluster have node data size difference

I have a 3-node cassandra cluster (version 3.11.11) with replication factor 3. only 2 of the nodes are receiving requests, and Node3 only sync with the other 2 nodes.
In theory, each node should have the same data size. But in practice, I end up with nodes with different data sizes as shown in the picture.
we have daily nodetool repair, operations like compaction are done automatically with default settings.
What can be the reason for the size difference?
It finally ends up how data gets compacted in the long run. Since compaction is local process and how sstables can be stacked up cannot be guaranteed. So I dont see any abbreviation here. Theory just say all nodes will have same data logically but physically it may vary. For example in node3 you may have old sstables that are not getting compacted due to size (if using STCS) and in other nodes they have compacted and reduced the size of those nodes.

Two Cassandra nodes with a replication factor of 2 yet different sizes in storage used

We have OpenNMS sending graph data to our Cassandra/Newts cluster which is comprised of 2 Cassandra nodes. I've set the replication factor to 2 for the keyspace "newts".
I started the nodes at the same time and left them up for some time, i then ran "nodetool cfstats newts" on each node and both nodes have the exact same write count.
If i however go in to the data directory "/db/newts" of each node and run "du -h" i can see the following:
Node1 storage used: 36K
Node2 storage used: 12M
How can they differ in size if i set the replication factor to 2? I know that they're connected to the same cluster via "nodetool status" which is showing both nodes as "UN" (Up/Normal).
In Cassandra data is not written directly to the hard drive, it lives in:
Commit log >> Memtable >> SSTables
Here you can find a good documentation on how data is written.
You can run:
nodetool flush
which will flush the memtables into sstables. After that you should be able to see the same sstable size on both of your nodes.

Size of cassandra partitions

What is the best tool to find the number of rows in each Cassandra partition? I have a big partition and I want to know how much records are there in that partition
nodetool tablehistograms <keyspace> <table> will give you the distribution of the cells and sizes of thee partition for the table. But that does not give you for sure that partition. To get the specific one you must use count(*) on a select query that specifies the partition key in where clause. A very large partition and that can fail though.
sstablemetadata after 4.0 is based off the describe command in sstable-tools. It will give you the partitions largest in size, and largest in number of rows, and the partitions with most tombstones if you provide the -s to scan the sstable. These can be used against 3.0 and 3.11 sstables. I think 2.1 sstables are not able to be processed though.
...
Partitions: 22515
Rows: 13579337
Tombstones: 0
Cells: 13579337
Widest Partitions:
[12345] 999999
[99049] 62664
[99007] 60437
[99017] 59728
[99010] 59555
Largest Partitions:
[12345] 189888705
[99049] 2965017
[99007] 2860391
[99017] 2826094
[99010] 2818038
...
above example has parititon key an int, it will print out key like:
Widest Partitions:
[frodo] 1
Largest Partitions:
[frodo] 104
You can find the total number of partitions available for a table with nodetool command. ./nodetool cfstats <keyspace>.<table>.
If you know the partition key, you can fire a select count(*) for the partition to get no. of the records in that partition. It's possible that query can timeout for count queries on big partitions set cqlsh request-timeout before executing the query.
To understand how to calculate the physical partition size, go through the Datastax DS220: Data Modeling partition size
Instaclustr has a tool to find the partition size. However, this does not show the number of records in each partition:
https://github.com/instaclustr/cassandra-sstable-tools
As mentioned above either use inbuilt node tool, which could be find within Cassandra Folder extracted from jar , and run nodetool inside terminal .
nodetool toppartitions
Additionally you can also use online tool such as : https://www.cqlguru.io/ , but this need some prior information as vaerage number of rows per partition, average number of text in varchar and all . But this tool is good for rough estimation.

Resources