Cassandra: Data loss after adding new node - cassandra

We had a two nodes cassandra cluster which we want to expand to four.
We followed the procedure described there: http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html
But after adding two more nodes (at same time, with a 2 minutes interval as recommended in the documentation), we experienced some data loss. In some column families, there was missing elements.
Here is the nodetool netstats:
[centos#ip-10-11-11-187 ~]$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.11.11.187 5.63 MB 256 ? 0e596912-d649-4eed-82a4-df800c422634 2c
UN 10.11.1.104 748.79 MB 256 ? d8b96739-0858-4926-9eb2-27c96ca0a1c4 2c
UN 10.11.11.24 7.11 MB 256 ? e3e76dcf-2c39-42e5-a34e-9e986d4a9f7c 2c
UN 10.11.1.231 878.91 MB 256 ? cc1b5cfd-c9d0-4ca9-bbb1-bce4b2deffc1 2c
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
I don't quite understand if the "Note" is bad or not.
When we added the nodes, we put the two first servers - already available in the cluster - in the seeds of the configuration of the first added node. For the second added node, we put the newly added node in the seeds as well.
We are using EC2Snitch, and the listen_address has been set to the above addresses on each server.
We didn't run the cleanup yet, but we tried to run a repair, and there was written that nothing were to be repaired in our keycap.
Here is how our cluster was created:
CREATE KEYSPACE keyspace_name WITH replication = {'class': 'NetworkTopologyStrategy', 'us-west-2': '1'} AND durable_writes = true;
And the options of all of our tables:
CREATE TABLE keyspace_name."CFName" (
// ...
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
The data reappears if I decommission the new nodes.
EDIT: It was actually an error when reading the documentation... A colleague did set auto_bootstrap to false instead of setting it to true...

You should perform nodetool rebuild on the new nodes after you add them with auto_bootstrap: false
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsRebuild.html
HTH

Well you can specify the keyspace name to remove the Node which in this case is
nodetool status keyspace_name

Related

Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in xyz - aborting

Node tool repair command is failing to repair form some of tables . Cassandra version is 3.11.6 version . I have following queries :
Is this really a problem . what is the impact if we ignore this error ?
How can we get rid of this token range for some keyspace ?
what is the possible reason that it complaining about this token range ?
here is the error trace :
[2020-11-12 16:33:46,506] Starting repair command #6 (d5fa7530-2504-11eb-ab07-59621b514775), repairing keyspace solutionkeyspace with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 256, pull repair: false, ignore unreplicated keyspaces: false)
[2020-11-12 16:33:46,507] Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in solutionkeyspace - aborting
[2020-11-12 16:33:46,507] Repair command #6 finished with error
error: Repair job has failed with the error message: [2020-11-12 16:33:46,507] Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in solutionkeyspace - aborting
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2020-11-12 16:33:46,507] Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in solutionkeyspace - aborting
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
Scheme definition
CREATE KEYSPACE solutionkeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '1', 'datacenter2': '1'} AND durable_writes = true;
CREATE TABLE solutionkeyspace.schemas (
namespace text PRIMARY KEY,
avpcontainer map<text, text>,
schemacreationcql text,
status text,
version text
) WITH bloom_filter_fp_chance = 0.001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 10800
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.05
AND speculative_retry = '99PERCENTILE';
nodetool status output
bash-5.0# nodetool -p 7199 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.16.0.68 5.71 MiB 256 ? 6a4d7b51-b57b-4918-be2f-3d62653b9509 rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
It looks like you are running Cassandra version 3.11.9+ where this new behaviour was introduced.
https://issues.apache.org/jira/browse/CASSANDRA-15160
You will not see this error if you add the following nodetool repair command line option;
-iuk
or
--ignore-unreplicated-keyspaces
This change would have broken people's repair scripts when they upgrade from an earlier version of Cassandra to 3.11.9+. We certainly noticed it.

Scaling in Cassandra

I tested throughput performance of Cassandra cluster with 2,3 and 4 nodes. There was significant improvement when I used 3 nodes(as compared to 2), however, the improvement wasn't so significant when I used 4 nodes, instead of 3.
Given below are specs of the 4 nodes:
N->No. of physical CPU cores, Ra->Total RAM, Rf->Free RAM
Node 1: N=16, Ra=189 GB, Rf=165 GB
Node 2: N=16, Ra=62 GB, Rf=44 GB
Node 3: N=12, Ra=24 GB, Rf=38 GB
Node 4: N=16, Ra=189 GB, Rf=24 GB
All nodes are on RHEL 6.5
Case 1(2 nodes in the cluster, Node 1 and Node 2)
Throughput: 12K ops/second
Case 2(3 nodes in the cluster, Node 1, Node 2 and Node 3)
Throughput: 20K ops/second
Case 3(All 4 nodes in the cluster)
Throughput: 23K ops/second
1 operation involved 1 read + 1 write(Read/write takes place on the same row)(Row cache can't be used). In all cases, Read consistency =2 and Write Consistency =1. Both read and write were asynchronous. The client application used Datastax's C++ driver and was being run with 10 threads.
Given below are the keyspace and table details:
CREATE KEYSPACE cass WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '2'} AND durable_writes = true;
CREATE TABLE cass.test_table (
pk text PRIMARY KEY,
data1_upd int,
id1 int,
portid blob,
im text,
isflag int,
ms text,
data2 int,
rtdata blob,
rtdynamic blob,
rtloc blob,
rttdd blob,
rtaddress blob,
status int,
time bigint
) WITH bloom_filter_fp_chance = 0.001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Some parameters from YAML are given below(All 4 nodes used similar YAML files):
commitlog_segment_size_in_mb: 32
concurrent_reads: 64
concurrent_writes: 256
concurrent_counter_writes: 32
memtable_offheap_space_in_mb: 20480
memtable_allocation_type: offheap_objects
memtable_flush_writers: 1
concurrent_compactors: 2
Some parameters from jvm.options are given below(all nodes used same values):
### CMS Settings
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=4
-XX:MaxTenuringThreshold=6
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
-XX:+CMSClassUnloadingEnabled
Given below are some client's connection specific parameters:
cass_cluster_set_max_connections_per_host ( ms_cluster, 20 );
cass_cluster_set_queue_size_io ( ms_cluster, 102400*1024 );
cass_cluster_set_pending_requests_low_water_mark(ms_cluster, 50000);
cass_cluster_set_pending_requests_high_water_mark(ms_cluster, 100000);
cass_cluster_set_write_bytes_low_water_mark(ms_cluster, 100000 * 2024);
cass_cluster_set_write_bytes_high_water_mark(ms_cluster, 100000 * 2024);
cass_cluster_set_max_requests_per_flush(ms_cluster, 10000);
cass_cluster_set_request_timeout ( ms_cluster, 12000 );
cass_cluster_set_connect_timeout (ms_cluster, 60000);
cass_cluster_set_core_connections_per_host(ms_cluster,1);
cass_cluster_set_num_threads_io(ms_cluster,10);
cass_cluster_set_connection_heartbeat_interval(ms_cluster, 60);
cass_cluster_set_connection_idle_timeout(ms_cluster, 120);
Is there anything wrong with the configurations due to which Cassandra didn't scale much when number of nodes were increased from 3 to 4?
During a test, you may check ThreadPools using nodetool tpstats.
You will be able to see if some stages have too many pending (or blocked) tasks.
If there are no issues with ThreadPools, may be you cloud launch a benchmark using cassandra-stress in order to see if the limitation comes from your client?
I don't know if it is only for test purpose but as far as I know, Read before Write data is an antipattern with Cassandra.

Can we add new Datacenter in existing cluster With higher Replication Factor

In Cassandra , to add new Datacenter with higher replication factor in the same cluster throws an error says some range with replication factor 1 is not found in any source Datacenter.
I have Datacenter with (X- RF = 2), and (Y -RF = 1) . I want to add Datacenter (Z - RF = 3).
I have added a nodes in Datacenter Z .
But on
nodetool rebuild -- X
It Fails with an error
java.lang.IllegalStateException: unable to find sufficient sources for streaming range (-3685074324747697686,-3680615207285604279] in keyspace with replication factor 1
Basic Details of all Column family :
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Your current configuration, as shown in your keyspace definition, is for a single DC. You need to ALTER KEYSPACE it - to include another DC into it. This would start the replication process: keys that will be read/written would be replicated to the new DC. To fully copy all the data, you would need (in addition) to use the nodetool rebuild -- DC2 command
What you need to do is ALTER KEYSPACE and then nodetool rebuild. Copy the text that you get from DESCRIBE KEYSPACE keyspace_name but without CREATE in the beginning. Add the new datacenter in the replication.
ALTER KEYSPACE keyspace_name WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1, 'datacenter2' : 3 };
Then do:
nodetool rebuild -- datacenter1
Based on your comments -
"I have alter my keyspaces but for the default keyspace like (system_distributed) it throws error"
Other than user specific keyspaces, make sure that default system keyspaces like "system_distributed" are on "NetworkTopologyStrategy" (No keyspaces on SimpleStratergy for multi-DC, except local strategy )
Reference: Point-2 https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html

Two Keyspace on Same DC

I already created one keyspace on DC.
Create query for tradebees_dev keyspace :- (This keyspace is working fine.)
CREATE KEYSPACE tradebees_dev WITH replication = {'class': 'NetworkTopologyStrategy', 'solr': '3'} AND durable_writes = true;
Status is below :-
nodetool status tradebees_dev
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 1.09 GB 256 100.0% e754d239-8370-4e1d-82c8-dce3d401f972 rack1
UN 127.0.0.2 1.19 GB 256 100.0% 70b74025-1881-4756-a0c8-a1ec5e57d407 rack1
UN 127.0.0.3 1.53 GB 256 100.0% 3ba4bfe4-c894-4cd1-a684-f0f20edac78f rack1
After that I created another keyspace on same DC with same replication factor .
Create query for crawl_dev keyspace :-
CREATE KEYSPACE crawl_dev WITH replication = {'class': 'NetworkTopologyStrategy', 'solr': '3'} AND durable_writes = true;
nodetool status crawl_dev
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 1.09 GB 256 0.0% e754d239-8370-4e1d-82c8-dce3d401f972 rack1
UN 127.0.0.2 1.19 GB 256 0.0% 70b74025-1881-4756-a0c8-a1ec5e57d407 rack1
UN 127.0.0.3 1.53 GB 256 0.0% 3ba4bfe4-c894-4cd1-a684-f0f20edac78f rack1
As first keyspace is working fine but when I am trying to do select query on second keyspace i.e on crawl_dev , I am getting below error message.
Traceback (most recent call last):
File "/usr/share/dse/resources/cassandra/bin/cqlsh", line 1124, in perform_simple_statement
rows = self.session.execute(statement, trace=self.tracing_enabled)
File "/usr/share/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-2.7.2-5d33cb4.zip/cassandra-driver-2.7.2-5d33cb4/cassandra/cluster.py", line 1602, in execute
result = future.result()
File "/usr/share/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-2.7.2-5d33cb4.zip/cassandra-driver-2.7.2-5d33cb4/cassandra/cluster.py", line 3347, in result
raise self._final_exception
Unavailable: code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}
Please suggest me how to resolve this issue. and also let me know can we create two keyspaces on same DC . YES or NO.
After some research , I got info then I checked
/etc/dse/cassandra/cassandra-rackdc.properties
in this file , dc=DC1 and rc=RACK1 is given.
Thanks.
The datacenter name in the "create keyspace" command is case-sensitive, so instead of:
CREATE KEYSPACE tradebees_dev WITH replication = {'class': 'NetworkTopologyStrategy', 'solr': '3'} AND durable_writes = true;
you want to capitalize Solr, for example:
CREATE KEYSPACE tradebees_dev WITH replication = {'class': 'NetworkTopologyStrategy', 'Solr': '3'} AND durable_writes = true;
You're on the right track for troubleshooting with your "nodetool status [keyspace]" commands. Notice in your result for tradebees_dev, each node reports 100% in the Owns column, which makes sense because you have RF 3 on a 3-node cluster. Then notice for crawl_dev it shows 0% which means no nodes own that data, and hence the error you received. In your example above I suspect you did create tradebees_dev with capital "Solr" in the replication factory and that's why it worked.
I don't see why you should not be able to create multiple keyspaces on the same DC. Indeed, you already have multiple keyspaces in the cluster:
cqlsh> DESCRIBE keyspaces;
system_traces system_schema system_auth system system_distributed

Cassandra: Fetching data but not populating cache as query does not query from the start of the partition

I'm running Cassandra version 3.3 on a fairly beefy machine. I would like to try out the row cache so I have allocated 2 GBs of RAM for row caching and configured the target tables to cache a number of their rows.
If I run a query on a very small table (under 1 MB) twice with tracing on, on the 2nd query I see a cache hit. However, when I run a query on a large table (34 GB) I only get cache misses and see this message after every cache miss:
Fetching data but not populating cache as query does not query from the start of the partition
What does this mean? Do I need a bigger row cache to be able to handle a 34 GB table with 90 million keys?
Taking a look at the row cache source code on github, I see that clusteringIndexFilter().isHeadFilter() must be evaluating to false in this case. Is this a function of my partitions being too big?
My schema is:
CREATE TABLE ap.account (
email text PRIMARY KEY,
added_at timestamp,
data map< int, int >
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': '100000'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
The query is simply SELECT * FROM account WHERE email='sample#test.com'

Resources