I found query (just a simple select query) takes long time when adding a new node to cluster.
My execution time log :
17:49:40.008 [ThreadPoolTaskScheduler-14] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 8 ms
17:50:00.010 [ThreadPoolTaskScheduler-3] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 15010 ms
17:50:15.008 [ThreadPoolTaskScheduler-4] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 10008 ms
17:50:20.008 [ThreadPoolTaskScheduler-16] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 7 ms
Normally it takes about 10ms, and suddently takes 15000ms when adding node.
And I found it stuck because waiting for new node init data
Cassandra log (the new node):
INFO [HANDSHAKE-/194.187.1.52] 2019-05-31 17:49:36,056 OutboundTcpConnection.java:560 - Handshaking version with /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,059 Gossiper.java:1055 - Node /194.187.1.52 is now part of the cluster
INFO [RequestResponseStage-1] 2019-05-31 17:49:36,069 Gossiper.java:1019 - InetAddress /194.187.1.52 is now UP
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [MigrationStage:1] 2019-05-31 17:49:39,347 ViewManager.java:137 - Not submitting build tasks for views in keyspace system_traces as storage service is not initialized
INFO [MigrationStage:1] 2019-05-31 17:49:39,352 ColumnFamilyStore.java:411 - Initializing system_traces.events
INFO [MigrationStage:1] 2019-05-31 17:49:39,382 ColumnFamilyStore.java:411 - Initializing system_traces.sessions
Stuck when : Node /194.187.1.52 is now part of the cluster
And client will wait for new node init all data
What I have tried:
1. I try use consistency with ONE or QUORUM, and is no difference
2. I try turn replication factor to 1, 2 or 3, and still no difference
Why new node become part of cluster when that node not init data completely.
Is there a way to solve this.
I expect when I query to old node, the performance is not influenced by just waiting for new node to init data.
.
.
.
I resolved this problem.
I write wrong config, I let all node become seeds even before they joining cluster, this cause read timed-out during adding new node to cluster.
After fix this. all read is normal, but somehow I found insert query timed-out during adding node.
Finally I tune this to avoid insert timed-out:
/sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
and also change conf to limit throughput
stream_throughput_outbound_megabits_per_sec : 100
Really thanks for help.
This is just a theory, but one possible cause of this is that the new node is being chosen as a coordinator by your driver client, in this case, consistency level and replication aren't the main contributing factor to the delay in servicing your query.
If the new node is slowly performing initially for whatever reason and the driver is sending requests to it, the behavior of the coordinator can impact the servicing of your request.
What exactly is runJob doing? You suggested it is making a single query, but is it possible that it's a range query?
If it's a single query and it's taking as long as 10 seconds, that seems odd as the default read_request_timeout is 5 seconds. If it's a range query (a read involving multiple partitions), the default is 10 seconds. Are you adjusting those timeouts?
When you see responses that long for a single query that could mean the coordinator is what is impeding responsiveness as otherwise if the coordinator was responsive and the replicas were slow, you'd see ReadTimeoutException message serviced to the client.
To better react to these cases, a number of client drivers implement a strategy called 'speculative execution'. As described in the documentation for the DataStax Java Driver for Apache Cassandra:
Sometimes a Cassandra node might be experiencing difficulties (ex: long GC pause) and take longer than usual to reply. Queries sent to that node will experience bad latency.
One thing we can do to improve that is pre-emptively start a second execution of the query against another node, before the first node has replied or errored out. If that second node replies faster, we can send the response back to the client (we also cancel the first execution – note that “cancelling” in this context simply means discarding the response when it arrives later, Cassandra does not support cancellation of in flight requests at this stage)
You could configure your driver to speculatively execute with a constant threshold for idempotent requests (such as reads are). In the 3.x java driver, its done this way:
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withSpeculativeExecutionPolicy(
new ConstantSpeculativeExecutionPolicy(
500, // delay before a new execution is launched
2 // maximum number of executions
))
.build();
In this case, if the coordinator was slow to respond, after 500ms the driver chooses another coordinator and submits a second quest, and the first coordinator to respond wins.
Note that this might cause an amplification of requests sent to your cluster overall, so you want to tune that delay in such a way that it only kicks in when response time is highly anomalous. In your case, if requests normally take less than 10ms, 500ms is probably a reasonable number depending on what your higher percentile latencies look like.
All that being said, if you are able to identify that the problem is the new node behaving poorly as a coordinator. It's worth understanding why. Adding speculative execution could be a nice way of possibly working around the problem, but it's probably better to try to understand why the new node is so slowly performing. Having monitoring in place to observe Cassandra's metrics would likely give great visibility into the problem.
That is a behaviour that you can find with too high consistency, or not having enough copies of data (replication factor). When a new node is added to the cluster, a rearrangement of the ownership of tokens occur, once that it is determined what data will the new node be the owner, it will start streaming that data, which may saturate the network.
In your question you don't mention the network setting or if you are using cloud instances, which have a direct impact for these constraints, for example, an AWS m3.large instance will be more restricted in network capabilities than an i3.4xlarge.
Other variable to consider is disk configuration, if you are using your own hardware look for the cap in IO of your drives settings; if you are on the cloud, using the instance storage, when available, will have a better performance than external volumes ( as AWS EBS; if this is the case, ensure that you are enabling the " EBS optimized" option if your instance allows it)
Usually a RF of 3 with consistency level of Quorum should also help you to prevent the issue.
I've 5 node Cass cluster and each one currently owning 1 TB of data. When I tried adding another node, it almost took 15+ Hours to bring to 'UN' State.
Is there a way to make it fast?
Cassandra version: 3.0.13
Environment : AWS , m4.2xlarge machines.
1 TB is a lot of data per node. Since you have a 5 node cluster and you're adding a new node that node will take 0,833 TB data that has to be streamed from all nodes. That is the equivalent of 6,67 Tbit, or 6990507 Mbit. Cassandra has a default value for stream_throughput_outbound_megabits_per_sec of 200. 6990507÷200 = 34952,535 seconds = 9,7 hours to transfer all data. Since you're probably running other traffic at the same time etc. this could very well take 15 hours.
Solution: Change the stream_throughput_outbound_megabits_per_sec on all nodes to a higher value.
Note: Don't forget to run nodetool cleanup after the node has joined the cluster.
I have a cassandra cluster of having single data center with 2 nodes.Two nodes are in two racks.First node having 6.6Gib and Second node having 5.9 Gib.Second node often turns down.When i restart the server it takes alot of time to turns into up.So can anyone help me out of this situation
I'm currently exploring CouchDB replication and trying to figure out the difference between max_replication_retry_count and retries_per_request configuration options in [replicator] section of configuration file.
Basically I want to configure continuous replication of local couchdb to the remote instance that would never stop replication attempts, considering potentially continuous periods of being offline(days or even weeks). So, I'd like to have infinite replication attempts with maximum retry interval of 5 minutes or so. Can I do this? Do I need to change default configuration to achieve this?
Here's the replies I've got at CouchDB mailing lists:
If we are talking Couch 1.6, the attribute retries_per_request
controls a number of attempts a current replication is going to do to
read _changes feed before giving up. The attribute
max_replication_retry_count controls a number of attempts the whole replication job is going to be retried by a replication manager.
Setting this attribute to “infinity” should make the replicaton
manager to never give up.
I don’t think the interval between those attempts is configurable. As
far as I understand it’s going to start from 2.5 sec between the
retries and then double until reached 10 minutes, which is going to be
hard upper limit.
Extended answer:
The answer is slightly different depending if you're using 1.x/2.0
releases or current master.
If you're using 1.x or 2.0 release: Set "max_replication_retry_count =
infinity" so it will always retry failed replications. That setting
controls how the whole replication job restarts if there is any error.
Then "retries_per_request" can be used to handle errors for individual
replicator HTTP requests. Basically the case where a quick immediate
retry succeeds. The default value for "retries_per_request" is 10.
After the first failure, there is a 0.25 second wait. Then on next
failure it doubles to 0.5 and so on. Max wait interval is 5 minutes.
But If you expect to be offline routinely, maybe it's not worth
retrying individual requests for too long so reduce the
"retries_per_request" to 6 or 7. So individual requests would retry a
few times for about 10 - 20 seconds then the whole replication job
will crash and retry.
If you're using current master, which has the new scheduling
replicator: No need to set "max_replication_retry_count", that setting
is gone and all replication jobs will always retry for as long as
replication document exists. But "retries_per_request" works the same
as above. Replication scheduler also does exponential backoffs when
replication jobs fail consecutively. First backoff is 30 seconds. Then
it doubles to 1 minute, 2 minutes, and so on. Max backoff wait is
about 8 hours. But if you don't want to wait 4 hours on average for
the replication to restart when network connectivity is restored, and
want to it be about 5 minutes or so, set "max_history = 8" in the
"replicator" config section. max_history controls how much history of
past events are retained for each replication job. If there is less
history of consecutive crashes, that backoff wait interval will also
be shorter.
So to summarize, for 1.x/2.0 releases:
[replicator] max_replication_retry_count = infinity
retries_per_request = 6
For current master:
[replicator] max_history = 8 retries_per_request = 6
Unable to add a new node to a Cassandra 1.2 ring because streaming times out. We have streaming_socket_timeout_in_ms=30000. What should I change it to? Why is streaming not retried ad infinitum until successful?
In the end it worked, after a dozen attempts. The difference was:
Raise streaming_socket_timeout_in_ms=600000 (10 min), to be longer than the longest imaginable GC pause (Maybe 0ms would have also worked.)
Before starting the bootstrap of the new node, do a rolling restart of the nodes in the same datacenter as the new node, so that their heap starts fresh, from a low value
Start the bootstrap at the end of the work day and leave it on overnight