Offline compaction/merging of multiple SSTables into one

Offline compaction/merging of multiple SSTables into one - cassandra

$ cd /tmp
$ cp -r /var/lib/cassandra/data/keyspace/table-6e9e81a0808811e9ace14f79cedcfbc4 .
$ nodetool compact --user-defined table-6e9e81a0808811e9ace14f79cedcfbc4/*-Data.db
I expected the two SSTables (where the second one contains only tombstones) to be merged into one, which would be equivalent to the first one minus data masked by tombstones from the second one.
However, the last command returns 0 exit status and nothing changes in the table-6e9e81a0808811e9ace14f79cedcfbc4 directory (still two tables are there). Any ideas how to unconditionally merge potentially multiple SSTables into one in the offline manner (like above, not on SSTable files currently used by the running cluster)?

Just nodetool compact <keyspace> <table> There is no real offline compaction, only telling cassandra which sstables to compact. user-defined compaction just is to give it a custom list of sstables and a major compaction (above example) will include all sstables in a table.
While it really depends on which version your using on if it will work there is https://github.com/tolbertam/sstable-tools#compact available. If desperate can import cassandra-all for your version and do like it : https://github.com/tolbertam/sstable-tools/blob/master/src/main/java/com/csforge/sstable/Compact.java

Related

What to do with empty directories from Cassandra on disk?

I have Cassandra 3.11.4 and been running a test environment for a while. I have done nodetool cleanup, clearsnapshot, repair, compact etc and what remains in the data storage directory for my keyspace contains numerous "empty" directories.
When running du from the directory:
0 ./a/backups
47804 ./a
0 ./b/backups
0 ./b
0 ./c/backups
0 ./c
0 ./d/backups
0 ./d
7748832 .
Just a portion of the data with names renamed to generic letters, but essentially there are many of these empty directories remaining. The tables referenced however have either already been dropped a long time ago i.e. longer than gc_grace_seconds but the directory links remain? These are not snapshots, as making a snapshot and clearing it with nodetool clearsnapshot works fine.
Before I manually delete each of the empty folders, which is going to be a pain as there are a lot of them; am I missing a step in maintaining my cluster which causes this or is it something that happens and would have to be handled regularly assuming many changes in my test schemas?
Snapshots get cleared and the /backups trailing kind of mean that these are incremental backups?
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupIncremental.html
Even if it is though, there are no methods to remove these incremental backups that I can find at least with nodetool, and at the least, the setting for incremental_backups in cassandra.yaml is False.
I believe there are answers stating it is safe to delete these "ghost" directories but it would be extremely annoying if the keyspace has many of these. Also, maybe it is just my idea of wanting clean directories, would these "ghost" directories have an impact on performance?

So the "ghost" table directories are either from:
1) empty table - still a valid table, but no data ever inserted
2) truncated tables
3) dropped tables
In the first and second case, if you remove the directory, you could end up causing issues. If you want to validate whether the directory is in use for that table you can query:
select id from system_schema.tables
where keyspac_name = 'xxxx' and
table_name = 'yyyy';
That ID is the id used for the directory extension for that table. Any other occurrences of that directory for that table for that keyspace are not in use.
-Jim

When is it NOT necessary to truncate the table when restoring a snapshot (incremental) for Cassandra?

When is it NOT necessary to truncate the table when restoring a snapshot (incremental) for Cassandra?
All the different documentation "providers" including the 2nd edition of the Cassandra The Definitive Guide, it says something like this... "If necessary, truncate the table." If you restore without truncating (removing the tombstone), Cassandra continues to shadow the restored data. This behavior also occurs for other types of overwrites and causes the same problem.
If I have an insert only C* keyspace (no upserts and no deletes), do I ever need to truncate before restoring?
The documentation seems to imply that I can delete all of the sstable files from a column family (rm -f /data/.), copy the snapshot to /data/, and nodetool refresh.
Is this true?

You are right - you can restore a snapshot excatly this way. Copy over the sstables, restart the node and you are done. With incremental backups be sure you got all sstables with your data.
What could happen if you have updates and deletes is that after restoring a node or during restoring multiple nodes is that there is stale data available or you could run into problems with tombstones when data was deleted after the snapshot.
The magic with truncating tables is that all data is gone at once and you avoid such problems.

Does sstableloader insert pairs, replicated over different sstables, uniquely?

I used sstableloader to import snapshots from a cluster of 4 nodes configured to replicate four times. The folder structure of the snapshots is:
<keyspace>/<tablename>/snapshots/<timestamp>
Ultimately there were 4 timestamps in each snapshot folder, one for each node. They appeared in the same snapshot-directory, because I tar-gzipped them and extracted the snapshots of all nodes in the same directory.
I noticed that sstableloader couldn't handle this, because the folder should end with / as an assumption of the tool. Hence I restructured the folders to
<timestamp>/<keyspace>/<tablename>
And then I applied sstableloader to each timestamp:
sstableloader -d localhost <keyspace>/<tablename>
This seems hacky, as I restructured the folder, and I agree, but I couldn't get the sstableloader tool to work otherwise. If there is a better way, please let me know.
However, this worked:
Established connection to initial hosts
Opening sstables and calculating sections to stream
Streaming relevant part of <keyspace>/<tablename>/<keyspace>-<tablename>-ka-953-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-911-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-952-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-955-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-951-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-798-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-954-Data.db <keyspace>/<tablename>/<keyspace>-<tablename>-ka-942-Data.db to [/127.0.0.1]
progress: [/127.0.0.1]0:8/8 100% total: 100% 0 MB/s(avg: 7 MB/s)
Summary statistics:
Connections per host: : 1
Total files transferred: : 8
Total bytes transferred: : 444087547
Total duration (ms): : 59505
Average transfer rate (MB/s): : 7
Peak transfer rate (MB/s): : 22
So I repeated the command for each timestamp (and each keyspace and each tablename), and all the data got imported on the single-node setup of my laptop (default after installing cassandra on ubuntu from ppa).
Possibly important to note, before importing with sstableloader I initialized the keyspace with replication 1, instead of 3 on the 4-node-cluster server(s).
CREATE KEYSPACE <keyspace> WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Nevertheless, I noticed this:
$ du -sh /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/
6,4G /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/
However, when I query the size of the snapshots:
$ du -sh 142961465*/<keyspace>/<tablename>
2,9G 1429614655449/<keyspace>/<tablename>
3,1G 1429614656562/<keyspace>/<tablename>
2,9G 1429614656676/<keyspace>/<tablename>
2,7G 1429614656814/<keyspace>/<tablename>
The snapshots have a total size of 11.6GB, with replication 3 the essential part of the data should be ~3.9GB, however the /var/lib/cassandra/data/<keyspace>/<tablename>-e08e2540e82a11e4a64d8d887149c575/ folder is significantly larger. Why is this the case? How smart is cassandra / sstableloader? Are different redundant pairs filtered somehow?

You're almost certainly seeing Cassandra doing the right thing: It's importing each sstable, and letting timestamp resolution win.
It's probably the case that you various sstables had various older versions of data: older sstables had obsolete, shadowed cells, and newer sstables had new, live cells. As sstableloader pushes that data into the cluster, the oldest data is written first, and then obsoleted by the newer data as it's replayed. If there are deletes, then there will also be tombstones, which actually ADD space usage on top of everything else.
If you need to purge that obsolete data, you can run compaction (either using nodetool compact if that's an option for you - your data set is small enough it's probably fine - or something like http://www.encql.com/purge-cassandra-tombstones/ to do a single sstable at a time, if you're space constrained).

We were having a similar issue:
nodetool cleanup
nodetool compact keyspace1.tabel1 (Note: Manual compaction is not recommended as per Cassandra Documentation, we did this as part of migration)
We also found that sstableloader was creating very large files, we used sstablesplit to break down table into smaller files
https://cassandra.apache.org/doc/latest/cassandra/tools/sstable/sstablesplit.html

If I enable auto-compaction on Cassandra, do I still need to use `nodetool compact`?

I have a Cassandra cluster with a keyspace named foo and a table named y.
If I run the following command,
$ nodetool enableautocompaction foo y
do I still have to manually use nodetool compact on foo.y?
Does enableautocompaction enable minor compaction or major compaction? (The documentation for that command was rather sparse.)

It starts enabled unless you explicitly disable it, you shouldn't need it since its more for some special case scenarios and testing.
You also shouldn't run manual compactions with nodetool compact unless you are really sure about what you're doing. Once you run it, the sstable created won't be included in normal compactions for a very long time so you end up having to continually manually manage the number of sstables or suffer poor read performance.

Cassandra Rolling Tombstones

I am doing some simple operations in Cassandra, to keep things simple I am using a single node . I have one single row and I add 10,000 columns to it, next I go and delete these 10,000 columns, after a while I add 10,000 more columns to it and then delete them after some time and so on ... The deletes will delete all the columns in that one row.
Here's the thing which I don't understand, even though I delete them I see the size of the database increase, my GCGracePeriod is set to 0 and I am using Leveled Compaction Strategy.
If I understand the tombstones correctly, they should be deleted after the first major compaction, it appears that they are not deleted, even after running nodetool compact command.
I read on some mailing list that these are rolling tombstones (if you frequently update and delete the same row) and are not handled by major compaction. So my question is when are they deleted ? if not then the data would just grow, which i personally think is bad. To make matters worst I could not find any documentation about this particular effect.

First, as you're discovering, this isn't a really good idea. At the very least you should use row-level deletes, not individual column deletes.
Second, There is no such thing as a major compaction with LCS; nodetool compact is a no-op.
Finally, Cassandra 1.2 improves compaction a lot for workloads that generate a lot of tombstones: https://issues.apache.org/jira/browse/CASSANDRA-3442

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string