Cassandra: Removing a node - cassandra

I would like to remove a node from my Cassandra cluster and am following these two related questions (here and here) and the Cassandra document. But I am still not quite sure the exact process.
My first question is: Is the following way to remove a node from a Cassandra cluster correct?
decommission the node that I would like to remove.
removetoken the node that I just decommissioned.
If the above process is right, then how can I tell the decommission process is completed so that I can proceed to the second step? or is it always safe to do step 2 right after step 1?
In addition, Cassandra document says:
You can take a node out of the cluster with nodetool decommission to a
live node, or nodetool removetoken (to any other machine) to remove a
dead one. This will assign the ranges the old node was responsible for
to other nodes, and replicate the appropriate data there. If
decommission is used, the data will stream from the decommissioned
node. If removetoken is used, the data will stream from the remaining
replicas.
No data is removed automatically from the node being decommissioned,
so if you want to put the node back into service at a different token
on the ring, it should be removed manually.
Does this mean a decommissioned node is a dead node? In addition, as no data is removed automatically from the node being decommissioned, how can I tell when it is safe to remove the data from the decommissioned node (i.e., how to know when the data-streaming is completed?)

Removing a node from a Cassandra cluster should be the following steps (in Cassandra v1.2.8):
Decommission the target node by nodetool decommission.
Once the data streaming from the decommissioned node is completed, manually delete the data in the decommissioned node (optional).
From the docs:
nodetool decommission - Decommission the *node I am connecting to*
Update: The above process also works for seed nodes. In such case, the cluster is still able to run smoothly without requiring an restart. When you need to restart the cluster for other reasons, be sure to update the seeds parameter specified in the cassandra.yaml for all nodes.
Decommission the target node
When decommission starts, the decommissioned node will be first labeled as leaving (marked as L). In the following example, we will remove node-76:
> nodetool -host node-76 decommission
> nodetool status
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN node-70 9.79 GB 256 8.3% e0a7fb7a-06f8-4f8b-882d-c60bff51328a 155
UN node-80 8.9 GB 256 9.2% 43dfc22e-b838-4b0b-9b20-66a048f73d5f 155
UN node-72 9.47 GB 256 9.2% 75ebf2a9-e83c-4206-9814-3685e5fa0ab5 155
UN node-71 9.48 GB 256 9.5% cdbfafef-4bfb-4b11-9fb8-27757b0caa47 155
UN node-91 8.05 GB 256 8.4% 6711f8a7-d398-4f93-bd73-47c8325746c3 155
UN node-78 9.11 GB 256 9.4% c82ace5f-9b90-4f5c-9d86-0fbfb7ac2911 155
UL node-76 8.36 GB 256 9.5% 15d74e9e-2791-4056-a341-c02f6614b8ae 155
UN node-73 9.36 GB 256 8.9% c1dfab95-d476-4274-acac-cf6630375566 155
UN node-75 8.93 GB 256 8.2% 8789d89d-2db8-4ddf-bc2d-60ba5edfd0ad 155
UN node-74 8.91 GB 256 9.6% 581fd5bc-20d2-4528-b15d-7475eb2bf5af 155
UN node-79 9.71 GB 256 9.9% 8e192e01-e8eb-4425-9c18-60279b9046ff 155
When a decommissioned node is marked as leaving, it is streaming data to the other living nodes. Once the streaming is completed, the node will not be observed from the ring structure, and the data owned by the other nodes will increase:
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN node-70 9.79 GB 256 9.3% e0a7fb7a-06f8-4f8b-882d-c60bff51328a 155
UN node-80 8.92 GB 256 9.6% 43dfc22e-b838-4b0b-9b20-66a048f73d5f 155
UN node-72 9.47 GB 256 10.2% 75ebf2a9-e83c-4206-9814-3685e5fa0ab5 155
UN node-71 9.69 GB 256 10.6% cdbfafef-4bfb-4b11-9fb8-27757b0caa47 155
UN node-91 8.05 GB 256 9.1% 6711f8a7-d398-4f93-bd73-47c8325746c3 155
UN node-78 9.11 GB 256 10.5% c82ace5f-9b90-4f5c-9d86-0fbfb7ac2911 155
UN node-73 9.36 GB 256 9.7% c1dfab95-d476-4274-acac-cf6630375566 155
UN node-75 9.01 GB 256 9.5% 8789d89d-2db8-4ddf-bc2d-60ba5edfd0ad 155
UN node-74 8.91 GB 256 10.5% 581fd5bc-20d2-4528-b15d-7475eb2bf5af 155
UN node-79 9.71 GB 256 11.0% 8e192e01-e8eb-4425-9c18-60279b9046ff 155
Removing the remaining data manually
Once the streaming is completed, the data stored in the decommissioned node can be removed manually as described in the Cassandra document:
No data is removed automatically from the node being decommissioned,
so if you want to put the node back into service at a different token
on the ring, it should be removed manually.
This can be done by removing the data stored in the data_file_directories, commitlog_directory, and saved_caches_directory specified in the cassandra.yaml file in the decommissioned node.

Related

Is it possible to speed up Cassandra cleanup process?

I have Cassandra 3.11.1.0 cluster (6 nodes) and cleanup was not done after 2 nodes were joined.
I started nodetool cleanup on first node (192.168.20.197) and cleanup is in progress almost 30 days.
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.20.109 33.47 GiB 256 ? 677dc8b6-eb00-4414-8d15-9f1c79171069 rack1
UN 192.168.20.47 35.41 GiB 256 ? df8c1ee0-fabd-404e-8c55-42531b89d462 rack1
UN 192.168.20.98 20.65 GiB 256 ? 70ce02d7-779b-4b5a-830f-add6ed64bcc2 rack1
UN 192.168.20.21 33.03 GiB 256 ? 40863a80-5f25-464f-aa52-660149bc0070 rack1
UN 192.168.20.197 25.98 GiB 256 ? 5420eae3-e643-49e2-b2d8-703bd5a1f2d4 rack1
UN 192.168.20.151 21.9 GiB 256 ? be7d5df1-3edd-4bc3-8f34-867cb3b8bfca rack1
All nodes which were not cleaned are under load now, (CPU Load ~80-90% ) but new-joined (nodes 192.168.20.98 and 192.168.20.151 ) nodes have CPU Load ~10-20%
It looks like old nodes are loaded because of old data which can be cleaned up.
Each node has 61GB RAM and 8 CPU Cores. HEAP size is 30Gb
So, my questions are
Is it possible to speed up cleaning process?
Is CPU Load related to the old unused (which node is not owns
anymore) data on nodes?

Third Cassandra node has different load

We had a cassandra cluster with 2 nodes in the same datacenter with a keyspace replication factor of 2 for keyspace "newts". If i ran nodetool status i could see that the load was somewhat the same between the two nodes and each node sharing 100%.
I went ahead and added a third node and i can see all three nodes in the nodetool status output. I increased the replication factor to three since i now have three nodes and ran "nodetool repair" on the third node. However when i now run nodetool status i can see that the load between the three nodes differ but each node owns 100%. How can this be and is there something im missing here?
nodetool -u cassandra -pw cassandra status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 84.19.159.94 38.6 GiB 256 100.0% 2d597a3e-0120-410a-a7b8-16ccf9498c55 rack1
UN 84.19.159.93 42.51 GiB 256 100.0% f746d694-c5c2-4f51-aa7f-0b788676e677 rack1
UN 84.19.159.92 5.84 GiB 256 100.0% 8f034b7f-fc2d-4210-927f-991815387078 rack1
nodetool status newts output:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 84.19.159.94 38.85 GiB 256 100.0% 2d597a3e-0120-410a-a7b8-16ccf9498c55 rack1
UN 84.19.159.93 42.75 GiB 256 100.0% f746d694-c5c2-4f51-aa7f-0b788676e677 rack1
UN 84.19.159.92 6.17 GiB 256 100.0% 8f034b7f-fc2d-4210-927f-991815387078 rack1
As you added a node and there are now three nodes and increased your replication factor to three - each node will have a copy of your data and so own 100% of your data.
The different volume for "Load" can result from not running nodetool cleanup after adding your third node on the two old nodes - old data in your sstables won't be removed when adding the node (but later after a cleanup and/or compaction):
Load - updates every 90 seconds The amount of file system data under
the cassandra data directory after excluding all content in the
snapshots subdirectories. Because all SSTable data files are included,
any data that is not cleaned up, such as TTL-expired cell or
tombstoned data) is counted.
(from https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsStatus.html)
You just run nodetool repair on all 3 nodes and run nodetool cleanup one by one on existing nodes then restart the node one after another seems it works.

Cassandra data files MUCH larger than expected

I just did an experiment in which I loaded around a dozen csv files, weighing in at around 5.2 GB (compressed). After they are uploaded to Cassandra, they take up 64 GB! (actually around 128 GB but that is due to replication factor being 2).
Frankly I expected Cassandra's data to take up even less than the original 5.2 GB csv because:
1. Cassandra should be able to store data (mostly numbers) in binary format instead of ascii
2. Cassandra should have split a single file into its column constituents and improved compression dramatically
I'm completely new to Cassandra and this was an experiment. It is entirely possible that I misunderstand the product or mis-configured it.
Is it expected that 5.2 GB csvs will end up as 64 GB cassandra files?
EDIT: Additional info:
[cqlsh 5.0.1 | Cassandra 2.1.11 | CQL spec 3.2.1 | Native protocol v3]
[~]$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN xx.x.xx.xx1 13.17 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx2 14.02 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx3 13.09 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx4 12.32 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx5 12.84 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx6 12.66 GB 256 ? HOSTID RAC1
du -h [director which contains sstables before they are loaded]: 67GB

Cassandra nodetool status inconsistent on different nodes with too many pending compaction tasks

I have a cassandra 2.0.6 cluster with four nodes. The cassandra suffered from inconsistency problem. I use nodetool status to check the status on each nodes. The results are inconsistent. Besides this status command runs very slow. The followings are command result on each node.
Nodes with ip 192.168.148.181 and 192.168.148.121 are seed nodes. The cluster never run repair before.
Besides, the cpu usage on 181 and 121 is really high, and the log shows CMS GC is very frequent on these nodes. I disconnect all the clients and there are no read and write loads. This consistency and high GC persists.
So how to debug and optimize this cluster?
[cassandra#whaty121 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
UN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
DN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
UN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 8m50.506s
user 39m48.718s
sys 76m48.566s
--------------------------------------------------------------------------------
[cassandra#whaty221 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
UN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
DN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
UN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 0m15.075s
user 0m1.606s
sys 0m0.393s
----------------------------------------------------------------------
[cassandra#whaty181 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
UN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
UN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
UN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 0m25.719s
user 0m2.152s
sys 0m1.228s
-------------------------------------------------------------------------
[cassandra#whaty182 apache-cassandra-2.0.16]$ time bin/nodetool status
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.148.121 10.86 GB 1 25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2
DN 192.168.148.181 10.53 GB 1 25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1
UN 192.168.148.182 10.95 GB 1 25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4
DN 192.168.148.221 10.49 GB 1 25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3
real 0m17.581s
user 0m1.843s
sys 0m1.632s
I print the object details of gc:
num #instances #bytes class name
----------------------------------------------
1: 58584535 1874705120 java.util.concurrent.FutureTask
2: 58585802 1406059248 java.util.concurrent.Executors$RunnableAdapter
3: 58584601 1406030424 java.util.concurrent.LinkedBlockingQueue$Node
4: 58584534 1406028816 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask
5: 214682 24087416 [B
6: 217294 10430112 java.nio.HeapByteBuffer
7: 37591 5977528 [C
8: 41843 5676048 <constMethodKlass>
9: 41843 5366192 <methodKlass>
10: 4126 4606080 <constantPoolKlass>
11: 100060 4002400 org.apache.cassandra.io.sstable.IndexHelper$IndexInfo
12: 4126 2832176 <instanceKlassKlass>
13: 4880 2686216 [J
14: 3619 2678784 <constantPoolCacheKlass>
I used nodetool cfstats on one node and found that many compactions tasks have been accumulated in 3 days (I restarted the cluster 3 days ago)
[cassandra#whaty181 apache-cassandra-2.0.16]$ bin/nodetool compactionstats
pending tasks: 64642341
Active compaction remaining time : n/a
I checked compactionhistory. Here is part of results. It shows many records related to keyspace system.
Compaction History:
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged
8e4f8830-b04f-11e5-a211-45b7aa88107c system sstable_activity 1451629144115 3342 915 {4:23}
96a6fcb0-b04b-11e5-a211-45b7aa88107c system hints 1451627440123 18970740 18970740 {1:1}
7c42c940-adac-11e5-8bd4-45b7aa88107c system hints 1451339203540 56969835 56782732 {2:3}
585b97a0-ad98-11e5-8bd4-45b7aa88107c system sstable_activity 1451330553370 3700 956 {4:24}
aefc3f10-b1b2-11e5-a211-45b7aa88107c system sstable_activity 1451781670273 3201 906 {4:23}
1e76f1b0-b180-11e5-a211-45b7aa88107c system sstable_activity 1451759952971 3303 700 {4:23}
e7b75b70-aec2-11e5-8bd4-45b7aa88107c system hints 1451458783911 57690316 57497847 {2:3}
ad102280-af6d-11e5-b1dc-45b7aa88107c webtrn_study_log_formallySCORM_STU_COURSE 1451532129448 45671877 41137664 {1:11, 3:1, 4:8}
60906970-aec7-11e5-8bd4-45b7aa88107c system sstable_activity 1451460704647 3751 974 {4:25}
88aed310-ad91-11e5-8bd4-45b7aa88107c system hints 1451327627969 56984347 56765328 {2:3}
3ad14f00-af6d-11e5-b1dc-45b7aa88107c webtrn_study_log_formallySCORM_STU_COURSE 1451531937776 46696097 38827028 {1:8, 3:2, 4:9}
84df8fb0-b00f-11e5-a211-45b7aa88107c system hints 1451601640491 18970740 18970740 {1:1}
657482e0-ad33-11e5-8bd4-45b7aa88107c system sstable_activity 1451287196174 3701 931 {4:24}
9cc8af70-b24a-11e5-a211-45b7aa88107c system sstable_activity 1451846923239 3134 773 {4:23}
dcbe5e30-afd0-11e5-a211-45b7aa88107c system sstable_activity 1451574729619 3357 790 {4:23}
b285ced0-afa0-11e5-84e3-45b7aa88107c system hints 1451554042941 43310718 42137761 {1:1, 2:2}
119770e0-ad4e-11e5-8bd4-45b7aa88107c system hints 1451298651886 57397441 57190519 {2:3}
f1bb37a0-b204-11e5-a211-45b7aa88107c system hints 1451817000986 17713746
I tried to flush the node with high gc, but it returned failure with reading timeout.
The cluster just receives data to insert. I shutdown the client writes and restarted the cluster these 3 days. The compaction tasks still accumulates.
Inconsistency in nodetool status output is nothing to worry about. It is the result of having a lot of GCs. During GC a node is considered to be down by other nodes gossipers. Then when you got a lot of GCs, nodes are switching from DN to UN very quickly.
You must understand what is taking so much space in the java heap
Do you have any StatusLogger in cassandra logs ?
Using nodetool cfstats, do you see any system.hints ? Hints are mutations that have been delayed by the coordinator to be delivered later, when the load is lower. If your cluster have accumulated a lot of hints it would pressure the heap and cause GCs.
Is there any compaction going through ? nodetool compaction stats
Does flushing all column families cool down your cluster ? nodetool flush on all nodes for each ks and cf

Completely unbalanced DC after bootstrapping new node

I've just added a new node new into my Cassandra DC. Previously, my topology is as follows:
DC Cassandra: 1 node
DC Solr: 5 nodes
When I bootstrapped a 2nd node for the Cassandra DC, I noticed that the total bytes to be streamed is almost as big as the load of the existing node (916gb to stream; load of existing cassandra node is 956gb). Nevertheless, I allowed the bootstrap to proceed. It completed a few hours ago and now my fear is confirmed: the Cassandra DC is completely unbalanced.
Nodetool status shows the following:
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN solr node4 322.9 GB 40.3% 30f411c3-7419-4786-97ad-395dfc379b40 -8998044611302986942 rack1
UN solr node3 233.16 GB 39.7% c7db42c6-c5ae-439e-ab8d-c04b200fffc5 -9145710677669796544 rack1
UN solr node5 252.42 GB 41.6% 2d3dfa16-a294-48cc-ae3e-d4b99fbc947c -9004172260145053237 rack1
UN solr node2 245.97 GB 40.5% 7dbbcc88-aabc-4cf4-a942-08e1aa325300 -9176431489687825236 rack1
UN solr node1 402.33 GB 38.0% 12976524-b834-473e-9bcc-5f9be74a5d2d -9197342581446818188 rack1
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN cs node2 705.58 GB 99.4% fa55e0bb-e460-4dc1-ac7a-f71dd00f5380 -9114885310887105386 rack1
UN cs node1 1013.52 GB 0.6% 6ab7062e-47fe-45f7-98e8-3ee8e1f742a4 -3083852333946106000 rack1
Notice the 'Owns' column in the Cassandra DC: node2 owns 99.4% while node1 owns 0.6% (despite node2 having smaller 'Load' than node1). I expect them to own 50% each but this is what I got. I don't know what caused this. What I can remember is that I'm running a full repair in Solr node1 when I started the bootstrap of the new node. The repair is still running as of this moment (I think it actually restarted when the new node finished bootstrapping)
How do I fix this? (repair?)
Is it safe to bulk-load new data while the Cassandra DC is in this state?
Some additional info:
DSE 4.0.3 (Cassandra 2.0.7)
NetworkTopologyStrategy
RF1 in Cassandra DC; RF2 in Solr DC
DC auto-assigned by DSE
Vnodes enabled
Config of new node is modeled after the config of the existing node; so more or less it is correct
EDIT:
Turns out that I can't run cleanup too in cs-node1. I'm getting the following exception:
Exception in thread "main" java.lang.AssertionError: [SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-18509-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-18512-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38320-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38325-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38329-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38322-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38330-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38331-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38321-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38323-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38344-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38345-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38349-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38348-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38346-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-13913-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-13915-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38389-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-39845-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38390-Data.db')]
at org.apache.cassandra.db.ColumnFamilyStore$13.call(ColumnFamilyStore.java:2115)
at org.apache.cassandra.db.ColumnFamilyStore$13.call(ColumnFamilyStore.java:2112)
at org.apache.cassandra.db.ColumnFamilyStore.runWithCompactionsDisabled(ColumnFamilyStore.java:2094)
at org.apache.cassandra.db.ColumnFamilyStore.markAllCompacting(ColumnFamilyStore.java:2125)
at org.apache.cassandra.db.compaction.CompactionManager.performAllSSTableOperation(CompactionManager.java:214)
at org.apache.cassandra.db.compaction.CompactionManager.performCleanup(CompactionManager.java:265)
at org.apache.cassandra.db.ColumnFamilyStore.forceCleanup(ColumnFamilyStore.java:1105)
at org.apache.cassandra.service.StorageService.forceKeyspaceCleanup(StorageService.java:2220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
EDIT:
Nodetool status output (without keyspace)
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN solr node4 323.78 GB 17.1% 30f411c3-7419-4786-97ad-395dfc379b40 -8998044611302986942 rack1
UN solr node3 236.69 GB 17.3% c7db42c6-c5ae-439e-ab8d-c04b200fffc5 -9145710677669796544 rack1
UN solr node5 256.06 GB 16.2% 2d3dfa16-a294-48cc-ae3e-d4b99fbc947c -9004172260145053237 rack1
UN solr node2 246.59 GB 18.3% 7dbbcc88-aabc-4cf4-a942-08e1aa325300 -9176431489687825236 rack1
UN solr node1 411.25 GB 13.9% 12976524-b834-473e-9bcc-5f9be74a5d2d -9197342581446818188 rack1
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN cs node2 709.64 GB 17.2% fa55e0bb-e460-4dc1-ac7a-f71dd00f5380 -9114885310887105386 rack1
UN cs node1 1003.71 GB 0.1% 6ab7062e-47fe-45f7-98e8-3ee8e1f742a4 -3083852333946106000 rack1
Cassandra yaml from node1: https://www.dropbox.com/s/ptgzp5lfmdaeq8d/cassandra.yaml (only difference with node2 is listen_address and commitlog_directory)
Regarding CASSANDRA-6774, it's a bit different because I didn't stop a previous cleanup. Although I think I took a wrong route now by starting a scrub (still in-progress) instead of restarting the node first just like their suggested workaround.
UPDATE (2014/04/19):
nodetool cleanup still fails with an assertion error after doing the following:
Full scrub of the keyspace
Full cluster restart
I'm now doing a full repair of the keyspace in cs-node1
UPDATE (2014/04/20):
Any attempt to repair the main keyspace in cs-node1 fails with:
Lost notification. You should check server log for repair status of keyspace
I also saw this just now (output of dsetool ring)
Note: Ownership information does not include topology, please specify a keyspace.
Address DC Rack Workload Status State Load Owns VNodes
solr-node1 Solr rack1 Search Up Normal 447 GB 13.86% 256
solr-node2 Solr rack1 Search Up Normal 267.52 GB 18.30% 256
solr-node3 Solr rack1 Search Up Normal 262.16 GB 17.29% 256
cs-node2 Cassandra rack1 Cassandra Up Normal 808.61 GB 17.21% 256
solr-node5 Solr rack1 Search Up Normal 296.14 GB 16.21% 256
solr-node4 Solr rack1 Search Up Normal 340.53 GB 17.07% 256
cd-node1 Cassandra rack1 Cassandra Up Normal 896.68 GB 0.06% 256
Warning: Node cs-node2 is serving 270.56 times the token space of node cs-node1, which means it will be using 270.56 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management
Warning: Node solr-node2 is serving 1.32 times the token space of node solr-node1, which means it will be using 1.32 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management
Keyspace-aware:
Address DC Rack Workload Status State Load Effective-Ownership VNodes
solr-node1 Solr rack1 Search Up Normal 447 GB 38.00% 256
solr-node2 Solr rack1 Search Up Normal 267.52 GB 40.47% 256
solr-node3 Solr rack1 Search Up Normal 262.16 GB 39.66% 256
cs-node2 Cassandra rack1 Cassandra Up Normal 808.61 GB 99.39% 256
solr-node5 Solr rack1 Search Up Normal 296.14 GB 41.59% 256
solr-node4 Solr rack1 Search Up Normal 340.53 GB 40.28% 256
cs-node1 Cassandra rack1 Cassandra Up Normal 896.68 GB 0.61% 256
Warning: Node cd-node2 is serving 162.99 times the token space of node cs-node1, which means it will be using 162.99 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management
This is a strong indicator that something is wrong with the way cs-node2 bootstrapped (as I described at the beginning of my post).
It looks like your issue is that you most likely switch from single tokens to vnodes on your existing nodes. So all of their tokens are in a row. This is actually not possible to do in current Cassandra versions because it was too hard to get right.
The only real way to fix it and be able to add a new node is to decommission the first new node you added, then follow the current documentation on switching to vnodes from single nodes, which is basically that you need to make brand new data centers with brand new vnodes using nodes in them and then decommission the existing nodes.

Resources