cassandra-reaper: repairs repeatedly postponed and stuck - cassandra

Cassandra-Reaper v2.0.3
Cassandra v3.11.5.1
every day I run a repair on a single keyspace and since few weeks ago repairs never end.
The following is the information table taken from the reaper's dashboard:
ID | 00000000-0000-0177-0000-000000000000
-- | --
Owner | g
Cause | g
Last event | postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
Start time | March 9, 2020 10:45 AM
End time |  
Pause time |  
Duration | 22 hours 17 minutes 10 seconds
Segment count | 136
Segment repaired | 67
Intensity | 0.8999999761581421
Repair parallelism | PARALLEL
Incremental repair | false
Repair threads | 1
Nodes |  
Datacenters | DC1
Blacklist |  
Creation time | March 9, 2020 10:45 AM
Available metrics(can require a full run before appearing) | io.cassandrareaper.service.RepairRunner.repairProgress. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.segmentsDone. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.segmentsTotal. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.millisSinceLastRepair. mycluster.mkphistory.00000000000000070000000000000000
I also noted the very same message in the reaper's log repeated infinite times:
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
Few weeks ago this repair lasts just a couple of hours launched with 4 threads. I tried di decrease the number of thread employed in the repair but the result hasn't changed and the repair still stuck.
I also tried a rolling restart (I restarted also the reaper) without success.
Do you have any idea about this behavior?

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

dsbulk unload missing data

I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.
Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.
Run 1:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in
1 minute and 51 seconds.
Run 2:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in
3 minutes and 47 seconds.
Run 3:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in
3 minutes and 35 seconds.
It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.
I'm invoking unload as follows:
dsbulk unload -h $CASSANDRA_IP -k $KEYSPACE -t $CASSANDRA_TABLE > $DATA_FILE
I'm assuming this isn't expected behaviour for dsbulk. How do I configure it to reliably unload a complete table without errors?
Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).
You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.

Frequent Spikes in Cassandra write latency

In Production cluster , the Cluster Write latency frequently spikes from 7ms to 4Sec. Due to this clients face a lot of Read and Write Timeouts. This repeats in every few hours.
Observation:
Cluster Write latency (99th percentile) - 4Sec
Local Write latency (99th percentile) - 10ms
Read & Write consistency - local_one
Total nodes - 7
I tried to enable trace using settraceprobability for few mins and observed that mostly of the time is taken in internode communication
session_id | event_id | activity | source | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;' | cassandranode3 | 7 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c | Preparing statement | Cassandranode3 | 47 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c | reading data from /Cassandranode1 | Cassandranode3 | 121 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c | REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 | 40614 | MessagingService-Incoming-/Cassandranode1
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c | Processing response from /Cassandranode1 | Cassandranode3 | 40626 | SharedPool-Worker-5
I tried checking the connectivity between Cassandra nodes but did not see any issues. Cassandra logs are flooded with Read timeout exceptions as this is a pretty busy cluster with 30k reads/sec and 10k writes/sec.
Warning in the system.log:
WARN [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]
During the spike the cluster just stalls and simple commands like "use system_traces" command also fails.
cassandra#cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.
I validated the schema versions on all nodes and its the same but looks like during the issue time Cassandra is not even able to read the metadata.
Has anyone faced similar issues ? any suggestions ?
(from data from your comments above) The long full gc pauses can definitely cause this. Add -XX:+DisableExplicitGC you are getting full GCs because of calls to system.gc which is most likely from a silly DGC rmi thing that gets called at regular intervals regardless of if needed. With the larger heap that is VERY expensive. It is safe to disable.
Check your gc log header, make sure min heap size is not set. I would recommend setting -XX:G1ReservePercent=20

Trigger compactions when TWCS TTL changes in Cassandra 3.0

I have a table in cassandra where I saved data using clients TTL = 1 month (tables TTL is 0), the table is configured with time window compaction strategy.
Everyday Cassandra cleaned one single sstable containing expired data from one month ago.
Recently I changed the clients TTL to 15 days, I was expecting cassandra to clean two sstables a day at some point, and release the space. But it keeps cleaning one sstable a day and keeping 15 days of dead data.
How do I know?
for f in /data/cassandra/data/keyspace/table-*/*Data.db; do meta=$(sudo sstablemetadata $f); echo -e "Max:" $(date --date=#$(echo "$meta" | grep Maximum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') "Min:" $(date --date=#$(echo "$meta" | grep Minimum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') $(echo "$meta" | grep droppable) ' \t ' $(ls -lh $f | awk '{print $5" "$6" "$7" "$8" "$9}'); done | sort
This command list all the sstables
Max: 05/19/2018 Min: 05/18/2018 Estimated droppable tombstones: 0.9876591095477787 84G May 21 02:59 /data/cassandra/data/pcc/data_history-c46a3220980211e7991e7d12377f9342/mc-218473-big-Data.db
Max: 05/20/2018 Min: 05/19/2018 Estimated droppable tombstones: 0.9875830312750179 84G May 22 15:25 /data/cassandra/data/pcc/data_history-c46a3220980211e7991e7d12377f9342/mc-221915-big-Data.db
Max: 05/21/2018 Min: 05/20/2018 Estimated droppable tombstones: 0.9876636061230402 85G May 23 13:56 /data/cassandra/data/pcc/data_history-c46a3220980211e7991e7d12377f9342/mc-224302-big-Data.db
...
For now I have been triggering the compactions manually using JMX, but I want all erased as it would normally do.
run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction /data/cassandra/data/keyspace/sstable_path
I think I figured it out. Had to run manual compaction on the oldest and newest sstable with all it's content expired, both sstables at the same time.
After a couple of days it cleanead everything.
How do I know it was running? Because when I tried to run forceUserDefinedCompaction on any other sstable in between it always returned null.
EDIT:
It didn't work, again Sstable count keeps increasing.
EDIT:
Using sstableexpiredblockers pointed to the sstables blocking the rest of compactions. After compacting these manually it automatically compacted the rest.
On one node out of 8, the blocking sstable wasn't been unlocked after compacting, so a "nodetool scrub" did the job (which scrubs all the sstables).

PostgreSQL-BDR: Some of the nodes starts to replicate only after 2 hours after network problems

My setup is PostgreSQL-BDR on 4 servers with the same configuration.
After network problems (e.g. connection lost for some minutes), some of the nodes start to replicate in some seconds again, but other nodes starts to replicate only after 2 hours.
I couldn't find any configuration switch to set the timing of the replication.
I see the following lines when i am monitoring replication slots:
slot_name | database | active | retained_bytes
bdr_16385_6255603470654648304_1_16385__ | mvcn | t | 56
bdr_16385_6255603530602290326_1_16385__ | mvcn | f | 17640
bdr_16385_6255603501002479656_1_16385__ | mvcn | f | 17640
Any idea why this is happening?
The problem was that the default tcp_keepalive_time is 7200 seconds whitch is excatly 2 hour, so changing the value of /proc/sys/net/ipv4/tcp_keepalive_time solved the problem.

Resources