How do I find the size of a table in Cassandra Keyspaces? - cassandra

I have used this command to get the size of a table
nodetool cfstats -- <Keyspace>.<table name>
But, I am not sure whether it is right or wrong as if I upload more rows into my table, "Space Used" is not changing only "Memtable data size" is changing.
I just wanna know how to find the size of a table in cassandra keyspaces.

When a node receives a mutation (a write), it first gets persisted in a commit log then written to a memtable. The mutations/writes are not persisted to a SSTable until the memtable is flushed to disk.
If you want to force the memtables to be flushed to disk, run:
$ nodetool flush -- ks_name table_name
For more info, see How data is written in Cassandra. Cheers!

Related

nodetool tablehistograms command returned "No SSTables exists, unable to calculate 'Partition Size' and 'Cell Count' percentiles"

command I ran was
nodetool tablehistograms <keyspace> <table>
The bug was
No SSTables exists, unable to calculate 'Partition Size' and 'Cell Count' percentiles
I tried to calculate partition size for better selections on partition keys, but nodetool command did not work fine as the partition size is not provided with this error
SSTables are immutable as far as concerned, and I do not know if I should (and how to) create SSTables based on existed ones?
Experts, please come solve this problem, really appreciate it.
Best
How exact do you need to be when measuring the partition sizes?
For a quick estimate, 'nodetool tablestats <keyspace.table>' will give you the min, max and avg partition size.
If a more accurate measurement is needed, you could download and use DSBulk and run the count option to pull the largest n partitions for a table, which will also print the key, for example:
dsbulk count --stats.modes partitions --stats.numPartitions <n> -k myKeyspace -t myTable
There are no histograms available for the command to report if there are no SSTables on disk.
The nodetool tablehistograms command collects the metrics from the SSTables but if there are none stored on disk then there is nothing for the command to report.
Make sure that the table contains data in the data/ directory then try again. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

Cassandra read path

I'm learning Cassandra's read path.
According to some sources:
"When Cassandra receives the read request, data will be searched first in the Memtable, then data will be searched in SSTables and if data exists it is returned"
Also, I know that Memtables are periodically flushed to SSTables on disk.
My questions:
are memtables fully deleted from RAM after flushing to SSTables?
Suppose, we have a read request on a node. Node contains both memtables and SSTables.
Is it possible for Cassandra to get required data only from Memtables without accessing SSTables? If yes, when it is possible and how can Cassandra determine that required data stored only in Memtables and there are no other related data stored on disk (SSTables)?
Short answer to 2nd question - No. Cassandra will always check SSTables even if the data is in the memtable. The reason for that is the data in the memtable could be older than data in the SSTable. For example, if you're explicitly set write timestamp for records, or data is replayed from the hints on other node. When memtable is flushed, data is removed from memory. But in some cases you can use row cache if you have data that is often accessed.
You can read more about read path in the DSE arch guide.

Regarding Cassandra Table Size

How to calculate the total size of a keyspace in cassandra?
I have tried the nodetool cfstats and nodetool tablestats command. It is giving a lot of information, but I am not sure which field provides the exact information.
Can anybody suggest any method to find out the size of a keyspace and a table in Cassandra?
"nodetool tablestats" replaces the older command "nodetool cfstats". In other words both are the same. Output of this command lists the size of each of the tables within a keyspace.
Amongst the output, you are looking for "Space used (total)" value. Its the Total number of bytes of disk space used by SSTables belonging to this table, including obsolete SSTables waiting to be GCd.
Since there could be multiple tables within a keyspace, you need to sum up "Space used (total)" for all tables belonging to a keyspace to get size occupied by keyspace.
Another alternative if you have SSH access to the nodes, is to get to Cassandra Data directory and issue "du -h" to get the size of each keyspace directory. Again sum up the directory size on all nodes for that keyspace (ignoring the snapshot sizes).

How to compact sstables offline?

I am using CQLSSTableWriter to write sstables in an offline/bulk mode. The order is not enforced during the write operation. Is it possible to enforce a compaction before I use sstableloader to load data into cassandra cluster?
SStables are immutable in nature, also sstable is not just a file but its having data with metadata.
Meta data includes index.db etc. check datastax docs for more details.
so we should not do manually as the token range in each sstable will change during the compaction and the resultant sstable will not be having data evenly distributed.
Also compaction will leads to larger sstable and the node which will be having that sstable will become the hotspot.
it will be better/recommended not to do it manually.
You can drain the node via nodetool drain and then safely continue your compactions.

What is the purpose of Cassandra's commit log?

Please some one clarify for me to understand Commit Log and its use.
In Cassandra, while writing to Disk is the commit log the first entry point or MemTables.
If Memtables is what is getting flushed to disk, what is the use of Commit log, is the only purpose of commit log is to server sync issues if a data node is down?
You can think of the commit log as an optimization, but Cassandra would be unusably slow without it. When MemTables get written to disk we call them SSTables. SSTables are immutable, meaning once Cassandra writes them to disk it does not update them. So when a column changes Cassandra needs to write a new SSTable to disk. If Cassandra was writing these SSTables to disk on every update it would be completely IO bound and very slow.
So Cassandra uses a few tricks to get better performance. Instead of writing SSTables to disk on every column update, it keeps the updates in memory and flushes those changes to disk periodically to keep the IO to a reasonable level. But this leads to the obvious problem that if the machine goes down or Cassandra crashes you would lose data on that node. To avoid losing data, in addition to keeping recent changes in memory, Cassandra writes the changes to its CommitLog.
You may be asking why is writing to the CommitLog any better than just writing the SSTables. The CommitLog is optimized for writing. Unlike SSTables which store rows in sorted order, the CommitLog stores updates in the order which they were processed by Cassandra. The CommitLog also stores changes for all the column families in a single file so the disk doesn't need to do a bunch of seeks when it is receiving updates for multiple column families at the same time.
Basically writting the CommitLog to the disk is better because it has to write less data than writing SSTables does and it writes all that data to a single place on disk.
Cassandra keeps track of what data has been flushed to SSTables and is able to truncate the Commit log once all data older than a certain point has been written.
When Cassandra starts up it has to read the commit log back from that last known good point in time (the point at which we know all previous writes were written to an SSTable). It re-applies the changes in the commit log to its MemTables so it can get into the same state when it stopped. This process can be slow so if you are stopping a Cassandra node for maintenance it is a good idea to use nodetool drain before shutting it down which will flush everything in the MemTables to SSTables and make the amount of work on startup a lot smaller.
The write path in Cassandra works like this:
Cassandra Node ---->Commitlog-----------------> Memtable
| |
| |
|---> Periodically |---> Periodically
sync to disk flush to SSTable
Memtable and Commitlog are NOT written (kind of) in parallel. Write to Commitlog must be finished before starting to write to Memtable. Related source code stack is:
org.apache.cassandra.service.StorageProxy.mutateMV:mutation.apply->
org.apache.cassandra.db.Mutation.apply:Keyspace.open(keyspaceName).apply->
org.apache.cassandra.db.Keyspace.apply->
org.apache.cassandra.db.Keyspace.applyInternal{
Tracing.trace("Appending to commitlog");
commitLogPosition = CommitLog.instance.add(mutation)
...
Tracing.trace("Adding to {} memtable",...
...
upd.metadata().name(...);
...
cfs.apply(...);
...
}
The purpose of the Commitlog is to be able to recreate the Memtable after a node crashes or gets rebooted. This is important, since the Memtable only gets flushed to disk when it's 'full' - meaning the configured Memtable size is exceeded - or the flush is performed by nodetool or opscenter. So the data in Memtable is not persisted directly.
Having said that, a good thing before rebooting a node or container is to call nodetool flush to make sure your Memtables are fully persisted (flushed) to SSTables on disk. This also will reduce playback time of the Commitlog after the node or container comes up again.

Resources