We had a cassandra cluster with 2 nodes in the same datacenter with a keyspace replication factor of 2 for keyspace "newts". If i ran nodetool status i could see that the load was somewhat the same between the two nodes and each node sharing 100%.
I went ahead and added a third node and i can see all three nodes in the nodetool status output. I increased the replication factor to three since i now have three nodes and ran "nodetool repair" on the third node. However when i now run nodetool status i can see that the load between the three nodes differ but each node owns 100%. How can this be and is there something im missing here?
nodetool -u cassandra -pw cassandra status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 84.19.159.94 38.6 GiB 256 100.0% 2d597a3e-0120-410a-a7b8-16ccf9498c55 rack1
UN 84.19.159.93 42.51 GiB 256 100.0% f746d694-c5c2-4f51-aa7f-0b788676e677 rack1
UN 84.19.159.92 5.84 GiB 256 100.0% 8f034b7f-fc2d-4210-927f-991815387078 rack1
nodetool status newts output:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 84.19.159.94 38.85 GiB 256 100.0% 2d597a3e-0120-410a-a7b8-16ccf9498c55 rack1
UN 84.19.159.93 42.75 GiB 256 100.0% f746d694-c5c2-4f51-aa7f-0b788676e677 rack1
UN 84.19.159.92 6.17 GiB 256 100.0% 8f034b7f-fc2d-4210-927f-991815387078 rack1
As you added a node and there are now three nodes and increased your replication factor to three - each node will have a copy of your data and so own 100% of your data.
The different volume for "Load" can result from not running nodetool cleanup after adding your third node on the two old nodes - old data in your sstables won't be removed when adding the node (but later after a cleanup and/or compaction):
Load - updates every 90 seconds The amount of file system data under
the cassandra data directory after excluding all content in the
snapshots subdirectories. Because all SSTable data files are included,
any data that is not cleaned up, such as TTL-expired cell or
tombstoned data) is counted.
(from https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsStatus.html)
You just run nodetool repair on all 3 nodes and run nodetool cleanup one by one on existing nodes then restart the node one after another seems it works.
Related
I have Cassandra 3.11.1.0 cluster (6 nodes) and cleanup was not done after 2 nodes were joined.
I started nodetool cleanup on first node (192.168.20.197) and cleanup is in progress almost 30 days.
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.20.109 33.47 GiB 256 ? 677dc8b6-eb00-4414-8d15-9f1c79171069 rack1
UN 192.168.20.47 35.41 GiB 256 ? df8c1ee0-fabd-404e-8c55-42531b89d462 rack1
UN 192.168.20.98 20.65 GiB 256 ? 70ce02d7-779b-4b5a-830f-add6ed64bcc2 rack1
UN 192.168.20.21 33.03 GiB 256 ? 40863a80-5f25-464f-aa52-660149bc0070 rack1
UN 192.168.20.197 25.98 GiB 256 ? 5420eae3-e643-49e2-b2d8-703bd5a1f2d4 rack1
UN 192.168.20.151 21.9 GiB 256 ? be7d5df1-3edd-4bc3-8f34-867cb3b8bfca rack1
All nodes which were not cleaned are under load now, (CPU Load ~80-90% ) but new-joined (nodes 192.168.20.98 and 192.168.20.151 ) nodes have CPU Load ~10-20%
It looks like old nodes are loaded because of old data which can be cleaned up.
Each node has 61GB RAM and 8 CPU Cores. HEAP size is 30Gb
So, my questions are
Is it possible to speed up cleaning process?
Is CPU Load related to the old unused (which node is not owns
anymore) data on nodes?
I configured two DC with replication in two regions (NCSA and EMEA) using Janusgraph (Gremlin/Cassandra/Elasticsearch). The replication work well and everything, however the performance are not that great.
I get time of around 250ms just for a read on a node on NCSA (vs 30ms when I have only one 1 DC / 1 Node) and for a write it is around 800ms.
I tried to modify some configuration:
storage.cassandra.replication-factor
storage.cassandra.read-consistency-level
storage.cassandra.write-consistency-level
Is there any other settings/configurations that I could modify in order to get better performance for a multi-region setup or that kind of performance is expected with Janusgraph/Cassandra?
Thanks
The lowest time I was able to get were with
storage.replication-strategy-class=org.apache.cassandra.locator.NetworkTopologyStrategy
storage.cassandra.replication-factor=6
storage.cassandra.read-consistency-level=ONE
storage.cassandra.write-consistency-level=ONE
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.130.xxx.xxx 184.02 KB 256 100.0% 7c4c23f4-0112-4023-8af1-81a179f68973 RAC2
UN 10.130.xxx.xxx 540.67 KB 256 100.0% 193f0814-649f-4450-8b2e-85344f2c3cf2 RAC3
UN 10.130.xxx.xxx 187.47 KB 256 100.0% fbbc42d6-a061-4604-935e-dbe1155d4017 RAC1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.30.xxx.xxx 93.3 KB 256 100.0% e7221808-ccb4-414a-b5b6-6e578ecb6f25 RAC3
UN 10.30.xxx.xxx 287.62 KB 256 100.0% ca868262-4b5d-44d6-80f9-25439f8d2611 RAC2
UN 10.30.xxx.xxx 282.27 KB 256 100.0% 82d0f75d-635c-4016-84ca-ef9d1afda066 RAC1
Janusgraph comes with different caches levels, activate some of them may help.
About ConsistencyLevel, in a multi-dc configuration LOCAL_xxx values will provide better performances but for safety I will initialize the name of the local or closest Cassandra datacenter. (configuration parameter : storage.cassandra.astyanax.local-datacenter)
Are you able to say where the time is spent (in the Cassandra layer on in the JanusGraph layer)? To know what is the response time of Cassandra, you can run nodetool proxyhistograms which shows the full request latency recorded by the coordinator.
I have added a new node into the cluster and was expecting the data on Cassandra to balance itself across nodes.
node status yields
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.128.0.7 270.75 GiB 256 48.6% 1a3f6faa-4376-45a8-9c20-11480ae5664c rack1
UN 10.128.0.14 414.36 KiB 256 51.4% 66a89fbf-08ba-4b5d-9f10-55d52a199b41 rack1
Load of node 2 is just 400KB, we have time series data and query on that. how can I rebalance the load between these clusters?
configuration for both nodes are
cluster_name: 'cluster1'
- seeds: "node1_ip, node2_ip"
num_tokens: 256
endpoint_snitch: GossipingPropertyFileSnitch
auto_bootstrap: false
thank you for your time :)
I have added a new node into the cluster and was expecting the data on Cassandra to balance itself across nodes.
Explicitly setting `auto_bootstrap: false' tells it not to do that.
how can I rebalance the load?
Set your keyspace to a RF of 2.
Run nodetool -h 10.128.0.14 repair.
-Or-
Take the 10.128.0.14 out of the cluster.
Set auto_bootstrap: true (or just remove it).
And start the node up. It should join and stream data.
Pro-tip: With a data footprint of 270GB, you should have been running with more than one node to begin with. It would have been much easier to start with 3 nodes (which is probably the minimum you should be running on).
After setting up a 3 node cassandra cluster (cassandra version - 2.1.9), I ran the "nodetool status" command. I realized that the effective ownership % sums up to 200%.
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <IP> 105.35 KB 256 67.4% <HostID> rack1
UN <IP> 121.92 KB 256 63.3% <HostID> rack1
UN <IP3> 256.11 KB 256 69.3% <HostID> rack1
Does any one know why would we get a 200% ownership? Is it because of some replication factor? If so, how do I find out about that?
Thanks!
This is dependent on the replication factor of the keyspace you are displaying.
For example, if you create a keyspace like this:
CREATE KEYSPACE test_keyspace WITH replication = {'class':
'NetworkTopologyStrategy', 'datacenter1': 2 };
And then display the status of that keyspace:
nodetool status test_keyspace
Then the Owns column will sum to 200%.
If you used a replication factor of 3, it would sum to 300%, and if you used a replication factor of 1, it would sum to 100%.
To see how a keyspace is defined, go into cqlsh and enter desc keyspace test_keyspace
I've just added a new node new into my Cassandra DC. Previously, my topology is as follows:
DC Cassandra: 1 node
DC Solr: 5 nodes
When I bootstrapped a 2nd node for the Cassandra DC, I noticed that the total bytes to be streamed is almost as big as the load of the existing node (916gb to stream; load of existing cassandra node is 956gb). Nevertheless, I allowed the bootstrap to proceed. It completed a few hours ago and now my fear is confirmed: the Cassandra DC is completely unbalanced.
Nodetool status shows the following:
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN solr node4 322.9 GB 40.3% 30f411c3-7419-4786-97ad-395dfc379b40 -8998044611302986942 rack1
UN solr node3 233.16 GB 39.7% c7db42c6-c5ae-439e-ab8d-c04b200fffc5 -9145710677669796544 rack1
UN solr node5 252.42 GB 41.6% 2d3dfa16-a294-48cc-ae3e-d4b99fbc947c -9004172260145053237 rack1
UN solr node2 245.97 GB 40.5% 7dbbcc88-aabc-4cf4-a942-08e1aa325300 -9176431489687825236 rack1
UN solr node1 402.33 GB 38.0% 12976524-b834-473e-9bcc-5f9be74a5d2d -9197342581446818188 rack1
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN cs node2 705.58 GB 99.4% fa55e0bb-e460-4dc1-ac7a-f71dd00f5380 -9114885310887105386 rack1
UN cs node1 1013.52 GB 0.6% 6ab7062e-47fe-45f7-98e8-3ee8e1f742a4 -3083852333946106000 rack1
Notice the 'Owns' column in the Cassandra DC: node2 owns 99.4% while node1 owns 0.6% (despite node2 having smaller 'Load' than node1). I expect them to own 50% each but this is what I got. I don't know what caused this. What I can remember is that I'm running a full repair in Solr node1 when I started the bootstrap of the new node. The repair is still running as of this moment (I think it actually restarted when the new node finished bootstrapping)
How do I fix this? (repair?)
Is it safe to bulk-load new data while the Cassandra DC is in this state?
Some additional info:
DSE 4.0.3 (Cassandra 2.0.7)
NetworkTopologyStrategy
RF1 in Cassandra DC; RF2 in Solr DC
DC auto-assigned by DSE
Vnodes enabled
Config of new node is modeled after the config of the existing node; so more or less it is correct
EDIT:
Turns out that I can't run cleanup too in cs-node1. I'm getting the following exception:
Exception in thread "main" java.lang.AssertionError: [SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-18509-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-18512-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38320-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38325-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38329-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38322-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38330-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38331-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38321-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38323-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38344-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38345-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38349-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38348-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38346-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-13913-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-13915-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38389-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-39845-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38390-Data.db')]
at org.apache.cassandra.db.ColumnFamilyStore$13.call(ColumnFamilyStore.java:2115)
at org.apache.cassandra.db.ColumnFamilyStore$13.call(ColumnFamilyStore.java:2112)
at org.apache.cassandra.db.ColumnFamilyStore.runWithCompactionsDisabled(ColumnFamilyStore.java:2094)
at org.apache.cassandra.db.ColumnFamilyStore.markAllCompacting(ColumnFamilyStore.java:2125)
at org.apache.cassandra.db.compaction.CompactionManager.performAllSSTableOperation(CompactionManager.java:214)
at org.apache.cassandra.db.compaction.CompactionManager.performCleanup(CompactionManager.java:265)
at org.apache.cassandra.db.ColumnFamilyStore.forceCleanup(ColumnFamilyStore.java:1105)
at org.apache.cassandra.service.StorageService.forceKeyspaceCleanup(StorageService.java:2220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
EDIT:
Nodetool status output (without keyspace)
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN solr node4 323.78 GB 17.1% 30f411c3-7419-4786-97ad-395dfc379b40 -8998044611302986942 rack1
UN solr node3 236.69 GB 17.3% c7db42c6-c5ae-439e-ab8d-c04b200fffc5 -9145710677669796544 rack1
UN solr node5 256.06 GB 16.2% 2d3dfa16-a294-48cc-ae3e-d4b99fbc947c -9004172260145053237 rack1
UN solr node2 246.59 GB 18.3% 7dbbcc88-aabc-4cf4-a942-08e1aa325300 -9176431489687825236 rack1
UN solr node1 411.25 GB 13.9% 12976524-b834-473e-9bcc-5f9be74a5d2d -9197342581446818188 rack1
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN cs node2 709.64 GB 17.2% fa55e0bb-e460-4dc1-ac7a-f71dd00f5380 -9114885310887105386 rack1
UN cs node1 1003.71 GB 0.1% 6ab7062e-47fe-45f7-98e8-3ee8e1f742a4 -3083852333946106000 rack1
Cassandra yaml from node1: https://www.dropbox.com/s/ptgzp5lfmdaeq8d/cassandra.yaml (only difference with node2 is listen_address and commitlog_directory)
Regarding CASSANDRA-6774, it's a bit different because I didn't stop a previous cleanup. Although I think I took a wrong route now by starting a scrub (still in-progress) instead of restarting the node first just like their suggested workaround.
UPDATE (2014/04/19):
nodetool cleanup still fails with an assertion error after doing the following:
Full scrub of the keyspace
Full cluster restart
I'm now doing a full repair of the keyspace in cs-node1
UPDATE (2014/04/20):
Any attempt to repair the main keyspace in cs-node1 fails with:
Lost notification. You should check server log for repair status of keyspace
I also saw this just now (output of dsetool ring)
Note: Ownership information does not include topology, please specify a keyspace.
Address DC Rack Workload Status State Load Owns VNodes
solr-node1 Solr rack1 Search Up Normal 447 GB 13.86% 256
solr-node2 Solr rack1 Search Up Normal 267.52 GB 18.30% 256
solr-node3 Solr rack1 Search Up Normal 262.16 GB 17.29% 256
cs-node2 Cassandra rack1 Cassandra Up Normal 808.61 GB 17.21% 256
solr-node5 Solr rack1 Search Up Normal 296.14 GB 16.21% 256
solr-node4 Solr rack1 Search Up Normal 340.53 GB 17.07% 256
cd-node1 Cassandra rack1 Cassandra Up Normal 896.68 GB 0.06% 256
Warning: Node cs-node2 is serving 270.56 times the token space of node cs-node1, which means it will be using 270.56 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management
Warning: Node solr-node2 is serving 1.32 times the token space of node solr-node1, which means it will be using 1.32 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management
Keyspace-aware:
Address DC Rack Workload Status State Load Effective-Ownership VNodes
solr-node1 Solr rack1 Search Up Normal 447 GB 38.00% 256
solr-node2 Solr rack1 Search Up Normal 267.52 GB 40.47% 256
solr-node3 Solr rack1 Search Up Normal 262.16 GB 39.66% 256
cs-node2 Cassandra rack1 Cassandra Up Normal 808.61 GB 99.39% 256
solr-node5 Solr rack1 Search Up Normal 296.14 GB 41.59% 256
solr-node4 Solr rack1 Search Up Normal 340.53 GB 40.28% 256
cs-node1 Cassandra rack1 Cassandra Up Normal 896.68 GB 0.61% 256
Warning: Node cd-node2 is serving 162.99 times the token space of node cs-node1, which means it will be using 162.99 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management
This is a strong indicator that something is wrong with the way cs-node2 bootstrapped (as I described at the beginning of my post).
It looks like your issue is that you most likely switch from single tokens to vnodes on your existing nodes. So all of their tokens are in a row. This is actually not possible to do in current Cassandra versions because it was too hard to get right.
The only real way to fix it and be able to add a new node is to decommission the first new node you added, then follow the current documentation on switching to vnodes from single nodes, which is basically that you need to make brand new data centers with brand new vnodes using nodes in them and then decommission the existing nodes.