Does sstableloader insert pairs, replicated over different sstables, uniquely? - cassandra

I used sstableloader to import snapshots from a cluster of 4 nodes configured to replicate four times. The folder structure of the snapshots is:
<keyspace>/<tablename>/snapshots/<timestamp>
Ultimately there were 4 timestamps in each snapshot folder, one for each node. They appeared in the same snapshot-directory, because I tar-gzipped them and extracted the snapshots of all nodes in the same directory.
I noticed that sstableloader couldn't handle this, because the folder should end with / as an assumption of the tool. Hence I restructured the folders to
<timestamp>/<keyspace>/<tablename>
And then I applied sstableloader to each timestamp:
sstableloader -d localhost <keyspace>/<tablename>
This seems hacky, as I restructured the folder, and I agree, but I couldn't get the sstableloader tool to work otherwise. If there is a better way, please let me know.
However, this worked:
Established connection to initial hosts
Opening sstables and calculating sections to stream
Streaming relevant part of <keyspace>/<tablename>/<keyspace>-<tablename>-ka-953-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-911-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-952-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-955-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-951-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-798-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-954-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-942-Data.db to [/127.0.0.1]
progress: [/127.0.0.1]0:8/8 100% total: 100% 0 MB/s(avg: 7 MB/s)
Summary statistics:
Connections per host: : 1
Total files transferred: : 8
Total bytes transferred: : 444087547
Total duration (ms): : 59505
Average transfer rate (MB/s): : 7
Peak transfer rate (MB/s): : 22
So I repeated the command for each timestamp (and each keyspace and each tablename), and all the data got imported on the single-node setup of my laptop (default after installing cassandra on ubuntu from ppa).
Possibly important to note, before importing with sstableloader I initialized the keyspace with replication 1, instead of 3 on the 4-node-cluster server(s).
CREATE KEYSPACE <keyspace> WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Nevertheless, I noticed this:
$ du -sh /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/
6,4G /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/
However, when I query the size of the snapshots:
$ du -sh 142961465*/<keyspace>/<tablename>
2,9G 1429614655449/<keyspace>/<tablename>
3,1G 1429614656562/<keyspace>/<tablename>
2,9G 1429614656676/<keyspace>/<tablename>
2,7G 1429614656814/<keyspace>/<tablename>
The snapshots have a total size of 11.6GB, with replication 3 the essential part of the data should be ~3.9GB, however the /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/ folder is significantly larger. Why is this the case? How smart is cassandra / sstableloader? Are different redundant pairs filtered somehow?

You're almost certainly seeing Cassandra doing the right thing: It's importing each sstable, and letting timestamp resolution win.
It's probably the case that you various sstables had various older versions of data: older sstables had obsolete, shadowed cells, and newer sstables had new, live cells. As sstableloader pushes that data into the cluster, the oldest data is written first, and then obsoleted by the newer data as it's replayed. If there are deletes, then there will also be tombstones, which actually ADD space usage on top of everything else.
If you need to purge that obsolete data, you can run compaction (either using nodetool compact if that's an option for you - your data set is small enough it's probably fine - or something like http://www.encql.com/purge-cassandra-tombstones/ to do a single sstable at a time, if you're space constrained).

We were having a similar issue:
nodetool cleanup
nodetool compact keyspace1.tabel1 (Note: Manual compaction is not recommended as per Cassandra Documentation, we did this as part of migration)
We also found that sstableloader was creating very large files, we used sstablesplit to break down table into smaller files
https://cassandra.apache.org/doc/latest/cassandra/tools/sstable/sstablesplit.html

Related

Offline compaction/merging of multiple SSTables into one

$ cd /tmp
$ cp -r /var/lib/cassandra/data/keyspace/table-6e9e81a0808811e9ace14f79cedcfbc4 .
$ nodetool compact --user-defined table-6e9e81a0808811e9ace14f79cedcfbc4/*-Data.db
I expected the two SSTables (where the second one contains only tombstones) to be merged into one, which would be equivalent to the first one minus data masked by tombstones from the second one.
However, the last command returns 0 exit status and nothing changes in the table-6e9e81a0808811e9ace14f79cedcfbc4 directory (still two tables are there). Any ideas how to unconditionally merge potentially multiple SSTables into one in the offline manner (like above, not on SSTable files currently used by the running cluster)?
Just nodetool compact <keyspace> <table> There is no real offline compaction, only telling cassandra which sstables to compact. user-defined compaction just is to give it a custom list of sstables and a major compaction (above example) will include all sstables in a table.
While it really depends on which version your using on if it will work there is https://github.com/tolbertam/sstable-tools#compact available. If desperate can import cassandra-all for your version and do like it : https://github.com/tolbertam/sstable-tools/blob/master/src/main/java/com/csforge/sstable/Compact.java

Cleanup space in almost full Cassandra Node

I have a Cassandra Cluster (2 DC) with 6 nodes each and RF 2. 4 of the nodes (in each DC) getting full so I need to cleanup space very soon.
I tried to run a full repair but ended up as a bad idea since the space start increased even more and the repair eventually hanged. As a last solution I am thinking to start repairing and then cleanup specific columns starting from the smallest to the biggest.
i.e
nodetool repair -full foo_keyspace bar_columnfamily
nodetool cleanup foo_keyspace bar_columnfamily
Do you think that this procedure will be safe for the data?
Thank you
The commands that you presented in your question make several incorrect assumptions. First, "repair" is not supposed to, and will not, save any space. All repair does is to find inconsistencies between different replicas and repair them. It will either do nothing (if there's no inconsistencies), or add data, not remove data.
Second, "cleanup" is something you need to do after adding new nodes to the cluster - after each node sent some of its data to the new node, a "cleanup" removes the data from the old nodes. But cleanup is not relevant when not adding node.
The command you may be looking for is "compact". This can save space, but only when you know you had a lot of overwrites (rewriting existing rows), deletions or data expirations (TTL). What compaction strategy are you using? If it's the default, size-tiered compaction strategy (STCS) you can start major compaction (nodetool compact) but should be aware of a big risk involved:
Major compaction merges all the data into one sstable (Cassandra's on-disk file format), dropping deleted, expired or overwritten data. However, during this compaction process, you have both input and output files, and at worst case this may double your disk usage, and may fail if the disk is more than 50% full. This is why a lot of Cassandra best-practice guides suggest never to fill more than 50% of the disk. But this is just the worst case. You can get along with less free space if you know that the output file will be much smaller than the input (because most of the data has been deleted). Perhaps more usefully, if you have many separate tables (column family), you can compact each one separately (as you suggested, from smallest to biggest) and the maximum amount of disk space needed temporarily during the compaction can be much less than 50% of the disk.
Scylla, a C++ reimplementation of Cassandra, is developing something known as "hybrid compaction" (see https://www.slideshare.net/ScyllaDB/scylla-summit-2017-how-to-ruin-your-performance-by-choosing-the-wrong-compaction-strategy) which is like Cassandra's size-tiered compaction but does compaction in small pieces instead of generating one huge file, to avoid the huge temporary disk usage during compaction. Unfortunately, Cassandra doesn't have this feature yet.
Good idea is first start repair on smallest table on smallest keyspace one by one and complete repair. It will take time but safer way and no chance to hang and traffic loss.
Once repair completed start cleanup in the same way as repair. This way no impact on node and cluster as well.
You shouldn't fill more than about 50-60 % of your disks to make room for compaction. If you're above that amount of disk usage you need to consider getting bigger disks or add more nodes.
Datastax recommendations are usually good to follow: https://docs.datastax.com/en/dse-planning/doc/planning/planPlanningDiskCapacity.html

Multiple version of db files in Cassandra data folder

I have been running my code to read/write to cassandra column families. I have observed that my table size is around 10 GB but the space on disk is consumed by db files for the same table is around 400 GB with different versions of files.
la-2749-big-Statistics.db la-2750-big-Index.db la-2750-big-Filter.db
la-2750-big-Summary.db la-2750-big-Data.db la-2750-big-Digest.adler32
la-2750-big-CRC.db la-2750-big-TOC.txt la-2750-big-Statistics.db
la-2751-big-Filter.db la-2751-big-Index.db la-2751-big-Summary.db
la-2751-big-Data.db la-2751-big-Digest.adler32 la-2751-big-CRC.db
la-2751-big-Statistics.db la-2751-big-TOC.txt
la-2752-big-Index.db la-2752-big-Filter.db la-2752-big-Summary.db
la-2752-big-Data.db la-2752-big-Digest.adler32 la-2752-big-CRC.db
la-2752-big-TOC.txt la-2752-big-Statistics.db
Would like to understand if the latest version of the file set has all the data required and can I remove the older versions? Does cassandra provide facility for rolling deletion of such files?
The number you refer to is the number of the SSTable (I think it is technically called generation). Specifically, the format of the filename is:
CFName-Generation-SSTableFormat-ComponentFile
In you case:
CFName = la
Generation = 275x
SSTableFormat = BIG
ComponentFile = Data.db, TOC.txt, etc...
You can't really tell if the last SSTable contains all the data you need. The space on disk consumed by old generations may be released only if data in not referenced anymore (snapshots comes to mind), and their tombstones age is greater than the gc_grace_seconds.
You should first check if you have any snapshots, and eventually use the nodetool to remove them. Then you should investigate how your tombstones are distributed among these SSTables, and in that case you may have probably a bigger problem to solve if tombstones cannot get compacted away (eg schema redesign, or data migration to a new cluster).

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Is SSTables or Hfiles merged above 1TB?

In major compaction merge all sstables from region server (Hbase) and all SSTables form tablet server (Cassandra) into big one.
If period comes is meany SSTables (total space above 1TB) merged into one?
Maby there is some range bounds for SSTable or HFile that splits it to several parts - for ensure that merge operations dont "rewrite all server"?
My question is related to "Compaction" section of this link http://wiki.apache.org/cassandra/MemtableSSTable
From what I found actually SSTable producted by major compaction is not splited in Cassandra. Other LSM-tree databases relies in this case on disturbed file system whitch splits SSTable (or HFile, CellSotre in Hypertable) into several files (for example 64MB) but major compaction either must compact all of this file into new one SSTable (i think is inefficient).
There are tickets in JIRA to improve and redesign compaction for Cassandra as mentioned:
https://issues.apache.org/jira/browse/CASSANDRA-1608
You may also want read my second simiral question:
How much data per node in Cassandra cluster?

Resources