Unable to add a new node to a Cassandra 1.2 ring because streaming times out. We have streaming_socket_timeout_in_ms=30000. What should I change it to? Why is streaming not retried ad infinitum until successful?
In the end it worked, after a dozen attempts. The difference was:
Raise streaming_socket_timeout_in_ms=600000 (10 min), to be longer than the longest imaginable GC pause (Maybe 0ms would have also worked.)
Before starting the bootstrap of the new node, do a rolling restart of the nodes in the same datacenter as the new node, so that their heap starts fresh, from a low value
Start the bootstrap at the end of the work day and leave it on overnight
Related
After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!
I found query (just a simple select query) takes long time when adding a new node to cluster.
My execution time log :
17:49:40.008 [ThreadPoolTaskScheduler-14] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 8 ms
17:50:00.010 [ThreadPoolTaskScheduler-3] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 15010 ms
17:50:15.008 [ThreadPoolTaskScheduler-4] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 10008 ms
17:50:20.008 [ThreadPoolTaskScheduler-16] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 7 ms
Normally it takes about 10ms, and suddently takes 15000ms when adding node.
And I found it stuck because waiting for new node init data
Cassandra log (the new node):
INFO [HANDSHAKE-/194.187.1.52] 2019-05-31 17:49:36,056 OutboundTcpConnection.java:560 - Handshaking version with /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,059 Gossiper.java:1055 - Node /194.187.1.52 is now part of the cluster
INFO [RequestResponseStage-1] 2019-05-31 17:49:36,069 Gossiper.java:1019 - InetAddress /194.187.1.52 is now UP
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [MigrationStage:1] 2019-05-31 17:49:39,347 ViewManager.java:137 - Not submitting build tasks for views in keyspace system_traces as storage service is not initialized
INFO [MigrationStage:1] 2019-05-31 17:49:39,352 ColumnFamilyStore.java:411 - Initializing system_traces.events
INFO [MigrationStage:1] 2019-05-31 17:49:39,382 ColumnFamilyStore.java:411 - Initializing system_traces.sessions
Stuck when : Node /194.187.1.52 is now part of the cluster
And client will wait for new node init all data
What I have tried:
1. I try use consistency with ONE or QUORUM, and is no difference
2. I try turn replication factor to 1, 2 or 3, and still no difference
Why new node become part of cluster when that node not init data completely.
Is there a way to solve this.
I expect when I query to old node, the performance is not influenced by just waiting for new node to init data.
.
.
.
I resolved this problem.
I write wrong config, I let all node become seeds even before they joining cluster, this cause read timed-out during adding new node to cluster.
After fix this. all read is normal, but somehow I found insert query timed-out during adding node.
Finally I tune this to avoid insert timed-out:
/sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
and also change conf to limit throughput
stream_throughput_outbound_megabits_per_sec : 100
Really thanks for help.
This is just a theory, but one possible cause of this is that the new node is being chosen as a coordinator by your driver client, in this case, consistency level and replication aren't the main contributing factor to the delay in servicing your query.
If the new node is slowly performing initially for whatever reason and the driver is sending requests to it, the behavior of the coordinator can impact the servicing of your request.
What exactly is runJob doing? You suggested it is making a single query, but is it possible that it's a range query?
If it's a single query and it's taking as long as 10 seconds, that seems odd as the default read_request_timeout is 5 seconds. If it's a range query (a read involving multiple partitions), the default is 10 seconds. Are you adjusting those timeouts?
When you see responses that long for a single query that could mean the coordinator is what is impeding responsiveness as otherwise if the coordinator was responsive and the replicas were slow, you'd see ReadTimeoutException message serviced to the client.
To better react to these cases, a number of client drivers implement a strategy called 'speculative execution'. As described in the documentation for the DataStax Java Driver for Apache Cassandra:
Sometimes a Cassandra node might be experiencing difficulties (ex: long GC pause) and take longer than usual to reply. Queries sent to that node will experience bad latency.
One thing we can do to improve that is pre-emptively start a second execution of the query against another node, before the first node has replied or errored out. If that second node replies faster, we can send the response back to the client (we also cancel the first execution – note that “cancelling” in this context simply means discarding the response when it arrives later, Cassandra does not support cancellation of in flight requests at this stage)
You could configure your driver to speculatively execute with a constant threshold for idempotent requests (such as reads are). In the 3.x java driver, its done this way:
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withSpeculativeExecutionPolicy(
new ConstantSpeculativeExecutionPolicy(
500, // delay before a new execution is launched
2 // maximum number of executions
))
.build();
In this case, if the coordinator was slow to respond, after 500ms the driver chooses another coordinator and submits a second quest, and the first coordinator to respond wins.
Note that this might cause an amplification of requests sent to your cluster overall, so you want to tune that delay in such a way that it only kicks in when response time is highly anomalous. In your case, if requests normally take less than 10ms, 500ms is probably a reasonable number depending on what your higher percentile latencies look like.
All that being said, if you are able to identify that the problem is the new node behaving poorly as a coordinator. It's worth understanding why. Adding speculative execution could be a nice way of possibly working around the problem, but it's probably better to try to understand why the new node is so slowly performing. Having monitoring in place to observe Cassandra's metrics would likely give great visibility into the problem.
That is a behaviour that you can find with too high consistency, or not having enough copies of data (replication factor). When a new node is added to the cluster, a rearrangement of the ownership of tokens occur, once that it is determined what data will the new node be the owner, it will start streaming that data, which may saturate the network.
In your question you don't mention the network setting or if you are using cloud instances, which have a direct impact for these constraints, for example, an AWS m3.large instance will be more restricted in network capabilities than an i3.4xlarge.
Other variable to consider is disk configuration, if you are using your own hardware look for the cap in IO of your drives settings; if you are on the cloud, using the instance storage, when available, will have a better performance than external volumes ( as AWS EBS; if this is the case, ensure that you are enabling the " EBS optimized" option if your instance allows it)
Usually a RF of 3 with consistency level of Quorum should also help you to prevent the issue.
I'm currently exploring CouchDB replication and trying to figure out the difference between max_replication_retry_count and retries_per_request configuration options in [replicator] section of configuration file.
Basically I want to configure continuous replication of local couchdb to the remote instance that would never stop replication attempts, considering potentially continuous periods of being offline(days or even weeks). So, I'd like to have infinite replication attempts with maximum retry interval of 5 minutes or so. Can I do this? Do I need to change default configuration to achieve this?
Here's the replies I've got at CouchDB mailing lists:
If we are talking Couch 1.6, the attribute retries_per_request
controls a number of attempts a current replication is going to do to
read _changes feed before giving up. The attribute
max_replication_retry_count controls a number of attempts the whole replication job is going to be retried by a replication manager.
Setting this attribute to “infinity” should make the replicaton
manager to never give up.
I don’t think the interval between those attempts is configurable. As
far as I understand it’s going to start from 2.5 sec between the
retries and then double until reached 10 minutes, which is going to be
hard upper limit.
Extended answer:
The answer is slightly different depending if you're using 1.x/2.0
releases or current master.
If you're using 1.x or 2.0 release: Set "max_replication_retry_count =
infinity" so it will always retry failed replications. That setting
controls how the whole replication job restarts if there is any error.
Then "retries_per_request" can be used to handle errors for individual
replicator HTTP requests. Basically the case where a quick immediate
retry succeeds. The default value for "retries_per_request" is 10.
After the first failure, there is a 0.25 second wait. Then on next
failure it doubles to 0.5 and so on. Max wait interval is 5 minutes.
But If you expect to be offline routinely, maybe it's not worth
retrying individual requests for too long so reduce the
"retries_per_request" to 6 or 7. So individual requests would retry a
few times for about 10 - 20 seconds then the whole replication job
will crash and retry.
If you're using current master, which has the new scheduling
replicator: No need to set "max_replication_retry_count", that setting
is gone and all replication jobs will always retry for as long as
replication document exists. But "retries_per_request" works the same
as above. Replication scheduler also does exponential backoffs when
replication jobs fail consecutively. First backoff is 30 seconds. Then
it doubles to 1 minute, 2 minutes, and so on. Max backoff wait is
about 8 hours. But if you don't want to wait 4 hours on average for
the replication to restart when network connectivity is restored, and
want to it be about 5 minutes or so, set "max_history = 8" in the
"replicator" config section. max_history controls how much history of
past events are retained for each replication job. If there is less
history of consecutive crashes, that backoff wait interval will also
be shorter.
So to summarize, for 1.x/2.0 releases:
[replicator] max_replication_retry_count = infinity
retries_per_request = 6
For current master:
[replicator] max_history = 8 retries_per_request = 6
I'm trying to add new node to our cluster (cassandra 2.1.11, 16 nodes, 32Gb ram, 2x3Tb hdd, 8core cpu, 1 datacenter, 2 racks, about 700Gb of data on each node). After start of new node, data (approx 600Gb total) from 16 existing nodes successfully transfered to new node and building of secondary indexes starts. The process of secondary indexes building looks normal, i see info about successfull completition of some secondary indexes building and some stream tasks:
INFO [StreamReceiveTask:9] 2015-11-22 02:15:23,153 StreamResultFuture.java:180 - [Stream #856adc90-8ddd-11e5-a4be-69bddd44a709] Session with /192.168.21.66 is complete
INFO [StreamReceiveTask:9] 2015-11-22 02:15:23,152 SecondaryIndexManager.java:174 - Index build of [docs.docs_ex_pl_ph_idx, docs.docs_lo_pl_ph_idx, docs.docs_author_login_idx, docs.docs_author_extid_idx, docs.docs_url_idx] complete
Curently 9 out of 16 streams successfully finished, according to logs. Everything looks fine, except one issue: this process already lasts 5 full days. There is no errors in logs, no anything suspicious, except extremely slow progress.
nodetool compactionstats -H
shows
Secondary index build ... docs 882,4 MB 1,69 GB bytes 51,14%
So there is some process of index building and it has some progress, but very slow, 1% in half a hour or so.
The only significant difference between the new node and any of existing nodes is the fact that cassandra java process has 21k open files, in contrast of 300 open files on any existing node, and 80k files in the data dir on new node in contrast of 300-500 files in the data dir on any existing node.
Is it normal? At this speed it looks i'll spend 16 weeks or so to add 16 more nodes.
I know this is an old question, but we ran into this exact issue with 2.1.13 using DTCS. We were able to fix it in our test environment by increasing memtable flush thresholds to 0.7 - which didn't make any sense to us, but may be worth trying.
I'm trying to simultaneously add 4 nodes to my current 2-node DC. I have Vnodes turned off as per Datastax suggestion. Right after the major index build in each node, the following warning is printed several times in the logs:
WARN [SolrSecondaryIndex ks.cf index initializer.] 2014-06-20
09:39:59,904 CassandraUtil.java (line 108) Error Operation timed out -
received only 3 responses. on attempt 1 out of 4 with CL QUORUM...
I understand what it means. But why is Cassandra expecting the nodes to fulfill the CL when these nodes are still bootstrapping? More importantly, how does the warning affect the bootstrap? I noticed that the nodes are not doing any index build or streaming anymore; but they also remained in "Active - Joining" state. Is there any chance that they will finish? What should I do?
I'm using DSE 4.0.3. All existing and new nodes in the DC are Search nodes. I pre-computed the tokens using the python program for MurMur3Partitioner.
EDIT:
Although nodetool compactionstats does not show any on-going index build in the nodes, for some reason, I still see a lot of these lines in the logs:
INFO [IndexPool backpressure thread-0] 2014-06-20 12:30:31,346 IndexPool.java (line 472) Throttling at 26 index requests per second with target total queue size at 40
INFO [IndexPool backpressure thread-0] 2014-06-20 12:30:34,169 IndexPool.java (line 428) Back pressure is active with total index queue size 18586 and average processing time 2770
EDIT:
Interestingly, I found the following lines in each node after digging through the log files:
INFO [main] 2014-06-20 09:39:48,588 StorageService.java (line 1036) Bootstrap completed! for the tokens [node token]
INFO [SolrSecondaryIndex ks.cf index initializer.] 2014-06-20 11:32:07,833 AbstractSolrSecondaryIndex.java (line 411) Reindexing 1417116631 commit log updates for core ks.cf
Based from these lines, I feel a lot safer that the bootstrap actually completed and that the nodes are simply re-indexing their data. I don't know, though, why the re-indexing process is not being shown in nodetool compactionstats.
It appears the bootstrap completed, and the DSE Search system is running normally.
why the re-indexing process is not being shown in nodetool compactionstat
DSE Search is not generally exposed via Cassandra command line tools. The log output should show the indexing as having completed, were you able to verify that?