Cassandra Nodetool Listsnapshots output is unclear - cassandra

What is true size in the output of the command nodetool listsnapshots?
There is no explanation in the Cassandra documentation.

Its the total size of sstables that only that snapshot has a hardlink of.
Snapshots just create hardlinks to the actual sstable component. Once compacted and deleted away the hardlink in the snapshot may be only link referencing the inode and preventing it from being freed. Thats what it will measure.
For example if you disable compaction and take a snapshot, immediately after the listsnapshots will show true size as zero. If you turn node off, and delete one sstable in the data directory then restart, the listsnapshots will show true size as the size of the deleted sstable.

I looked for a while and couldn't find anything in the Cassandra docs - however, the Scylladb docs (Scylla being itself derived from Cassandra) says true size is "Total size of all SSTables which are not backed up to disk".
Further reading offers the following example:
There is a single 1TB file in the snapshot directory. If that file also exists in the main column family directory, the size on the disk is 1TB and the true size is 0 because it is already backed up to disk.
It seems "true size" is the amount of data that has not yet been backed up - if your backups are fresh, it will be 0.

Related

Disk space of Cassandra node is over 80%

I'm running 12 nodes of Cassandra in AWS EC2 instance, 4 of them are using almost 80% of the disk space, so compaction failed on these nodes, since the type of the server is EC2 instance, I can't add mode disk space to the existing data volume on the fly, I can't ask IT team to add more nodes to scale and spread the clustre as disk space of other nodes is less than 40%, before fixing the unbalanced cluster issue, is there any way to free up some disk space?
My question is how can I find unused sstables and move them to another partition to run compaction and free up some space?
Any other suggestion to free up some disk space.
PS: I already dropped all the snapshots and backups.
If you are using vnodes then data sizes difference should not be that much. Before coming to solution we must find the reason for big difference in data sizes on different nodes.
You must look into logs to see if there is corruption of some big SStable which resulted in compaction failures and increase in data sizes. Or you can find something in your logs which points to the reason in increasing of disk sizes.
We faced an issue in Cassandra 2.1.16 due to some bug it happened that even after compaction old sstable files were not removed. We read the logs and identified the files which can be removed. This is an example where we found the reason of increased data size after reading the logs.
So your must identify the reason before solution. If it is a dire state you can identify keyspaces/tables which are not used during your main traffic and move those sstables in backup and remove those sstables. Once your compaction process is over you can bring them back.
Warning :Test any procedure before trying on production.

Cassandra - how to disable memtable flush

I'm running Cassandra with a very small dataset so that the data can exist on memtable only. Below are my configurations:
In jvm.options:
-Xms4G
-Xmx4G
In cassandra.yaml,
memtable_cleanup_threshold: 0.50
memtable_allocation_type: heap_buffers
As per the documentation in cassandra.yaml, the memtable_heap_space_in_mb and memtable_heap_space_in_mb will be set of 1/4 of heap size i.e. 1000MB
According to the documentation here (http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__memtable_cleanup_threshold), the memtable flush will trigger if the total size of memtabl(s) goes beyond (1000+1000)*0.50=1000MB.
Now if I perform several write requests which results in almost ~300MB of the data, memtable still gets flushed since I see sstables being created on file system (Data.db etc.) and I don't understand why.
Could anyone explain this behavior and point out if I'm missing something here?
One additional trigger for memtable flushing is commitlog space used (default 32mb).
http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsMemtableThruput.html
http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__commitlog_total_space_in_mb
Since Cassandra should be persistent, it should do writes to disk to come up with the data after the node failing. If you don't need this durability, you can use any other memory based databases - redis, memcache etc.
Below is the response I got from Cassandra user group, copying it here in case someone else is looking for the similar info.
After thinking about your scenario I believe your small SSTable size might be due to data compression. By default, all tables enable SSTable compression.
Let go through your scenario. Let's say you have allocated 4GB to your Cassandra node. Your memtable_heap_space_in_mb and
memtable_offheap_space_in_mb will roughly come to around 1GB. Since you have memtable_cleanup_threshold to .50 table cleanup will be triggered when total allocated memtable space exceeds 1/2GB. Note the cleanup threshold is .50 of 1GB and not a combination of heap and off heap space. This memtable allocation size is the total amount available for all tables on your node. This includes all system related keyspaces. The cleanup process will write the largest memtable to disk.
For your case, I am assuming that you are on a single node with only one table with insert activity. I do not think the commit log will trigger a flush in this circumstance as by default the commit log has 8192 MB of space unless the commit log is placed on a very small disk.
I am assuming your table on disk is smaller than 500MB because of compression. You can disable compression on your table and see if this helps get the desired size.
I have written up a blog post explaining memtable flushing (http://abiasforaction.net/apache-cassandra-memtable-flush/)
Let me know if you have any other question.
I hope this helps.

Cassandra dropped keyspaces still on HDD

I noticed an increase in the number of open files on my cassandra cluster and went to check the health of it. Nodetool status reported only 300gb in use per node of the 3TB each has allocated.
Shortly there after i began to see HEAP OOM errors showing up in the cassandra logs.
These nodes had been running for 3-4 months no issue, but had a series of test data populate and then dropped from them.
After checking the harddrives via the df command i was able to determine they were all between 90-100% filled in a jboded scenario.
edit: further investigation shows that the remaining files are in the 'snapshot' subfolder and the data subfolder itself has no db tables.
My question is, has anyone seen this? Why did compaction not free these tombstones? Is this a bug?
Snapshots aren't tombstones - they are a backup of your data.
As Highstead says you can drop any unused snapshots via the clearsnapshot command.
You can disable the automatic snapshot facility via the cassandra.yaml
https://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html#reference_ds_qfg_n1r_1k__auto_snapshot
Also check if you may have non-default true for snapshot_before_compaction
Snapshots occur over the lifetime of the cassandra cluster. These snapshots are not captured in a nodetool status but still occupy space. In this case the snapshots consuming all the space were created when a table was dropped.
To retrieve a list of current snapshots use the command nodetool listsnapshots
This feature can be disabled through editing /etc/cassandra/cassandra-env.sh and setting auto_snapshot to false. Alternatively these snapshots can be purged via the command nodetool clearsnapshot <name>.

Freeing disk space of overwritten data?

I have a table whose rows get overwritten frequently using the regular INSERT statements. This table holds ~50GB data, and the majority of it is overwritten daily.
However, according to OpsCenter, disk usage keeps going up and is not freed.
I have validated that rows are being overwritten and not simply being appended to the table. But they're apparently still taking up space on disk.
How can I free disk space?
Under the covers the way Cassandra during these writes is that a new row is being appended to the SSTable with a newer time stamp. When you perform a read the newest row (based on time stamp) is being returned to you as the row. However this also means that you are using twice the disk space to accomplish this. It is not until Cassandra runs a compaction operation that the older rows will be removed and the disk space recovered. Here is some information on how Cassandra writes to disk which explains the process:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_write_path_c.html?scroll=concept_ds_wt3_32w_zj__dml-compaction
A compaction is done on a node by node basis and is a very disk intensive operation which may effect the performance of your cluster during the time it is running. You can run a manual compaction using the nodetool compact command:
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html
As Aaron mentioned in his comment above overwriting all the data in your cluster daily is not really the best use case for Cassandra because of issues such as this one.

Does sstableloader insert pairs, replicated over different sstables, uniquely?

I used sstableloader to import snapshots from a cluster of 4 nodes configured to replicate four times. The folder structure of the snapshots is:
<keyspace>/<tablename>/snapshots/<timestamp>
Ultimately there were 4 timestamps in each snapshot folder, one for each node. They appeared in the same snapshot-directory, because I tar-gzipped them and extracted the snapshots of all nodes in the same directory.
I noticed that sstableloader couldn't handle this, because the folder should end with / as an assumption of the tool. Hence I restructured the folders to
<timestamp>/<keyspace>/<tablename>
And then I applied sstableloader to each timestamp:
sstableloader -d localhost <keyspace>/<tablename>
This seems hacky, as I restructured the folder, and I agree, but I couldn't get the sstableloader tool to work otherwise. If there is a better way, please let me know.
However, this worked:
Established connection to initial hosts
Opening sstables and calculating sections to stream
Streaming relevant part of <keyspace>/<tablename>/<keyspace>-<tablename>-ka-953-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-911-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-952-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-955-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-951-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-798-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-954-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-942-Data.db to [/127.0.0.1]
progress: [/127.0.0.1]0:8/8 100% total: 100% 0 MB/s(avg: 7 MB/s)
Summary statistics:
Connections per host: : 1
Total files transferred: : 8
Total bytes transferred: : 444087547
Total duration (ms): : 59505
Average transfer rate (MB/s): : 7
Peak transfer rate (MB/s): : 22
So I repeated the command for each timestamp (and each keyspace and each tablename), and all the data got imported on the single-node setup of my laptop (default after installing cassandra on ubuntu from ppa).
Possibly important to note, before importing with sstableloader I initialized the keyspace with replication 1, instead of 3 on the 4-node-cluster server(s).
CREATE KEYSPACE <keyspace> WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Nevertheless, I noticed this:
$ du -sh /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/
6,4G /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/
However, when I query the size of the snapshots:
$ du -sh 142961465*/<keyspace>/<tablename>
2,9G 1429614655449/<keyspace>/<tablename>
3,1G 1429614656562/<keyspace>/<tablename>
2,9G 1429614656676/<keyspace>/<tablename>
2,7G 1429614656814/<keyspace>/<tablename>
The snapshots have a total size of 11.6GB, with replication 3 the essential part of the data should be ~3.9GB, however the /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/ folder is significantly larger. Why is this the case? How smart is cassandra / sstableloader? Are different redundant pairs filtered somehow?
You're almost certainly seeing Cassandra doing the right thing: It's importing each sstable, and letting timestamp resolution win.
It's probably the case that you various sstables had various older versions of data: older sstables had obsolete, shadowed cells, and newer sstables had new, live cells. As sstableloader pushes that data into the cluster, the oldest data is written first, and then obsoleted by the newer data as it's replayed. If there are deletes, then there will also be tombstones, which actually ADD space usage on top of everything else.
If you need to purge that obsolete data, you can run compaction (either using nodetool compact if that's an option for you - your data set is small enough it's probably fine - or something like http://www.encql.com/purge-cassandra-tombstones/ to do a single sstable at a time, if you're space constrained).
We were having a similar issue:
nodetool cleanup
nodetool compact keyspace1.tabel1 (Note: Manual compaction is not recommended as per Cassandra Documentation, we did this as part of migration)
We also found that sstableloader was creating very large files, we used sstablesplit to break down table into smaller files
https://cassandra.apache.org/doc/latest/cassandra/tools/sstable/sstablesplit.html

Resources