Cassandra 3.10 debug.log contains frequent "FailureDetector.java:457 - Ignoring interval time of..." - cassandra

The debug.log files for one of our Cassandra 3.10 clusters has frequent messages similar to “FailureDetector.java:457 - Ignoring interval time of…”
The messages appear even if the cluster is idle. I see the messages at a rate of about 1 per second on each node of this 6 node cluster (3 nodes each in two data centers).
Can someone tell me what causes the messages and if they are something to be concerned about?
We have a couple of other small clusters supporting the same application (different environments) and I see this message much less often (days apart).

The FailureDetector is responsible of deciding if a node is considered UP or DOWN.
The gossip process tracks state from other nodes both directly (nodes
gossiping directly to it) and indirectly (nodes communicated about
secondhand, third-hand, and so on). Rather than have a fixed threshold
for marking failing nodes, Cassandra uses an accrual detection
mechanism to calculate a per-node threshold that takes into account
network performance, workload, and historical conditions. During
gossip exchanges, every node maintains a sliding window of
inter-arrival times of gossip messages from other nodes in the
cluster.
Here you can find the source code, which gives you the log message. It is set to DEBUG level because they may be helpful in tracking down the actual issue causing the latency, but don't indicate a problem on their own.
In other words: your node measures the acknowledgement latency for each gossip message sent to the other nodes e.g: X nanosec for IP address1, Z nanosec for IP address2, etc. If eitherX or Y is above the expected 2 sec threshold as stated in MAX_INTERVAL_IN_NANO, it will get reported.
Problems, which can cause this log message:
Huge load on the node(s): e.g too many large partitions
High pressure: e.g. too many queries in sort period of time
Bad network connection
The extra FailureDetector logging was added with this:
Expose phi values from failure detector via JMX and tweak debug
and trace logging (CASSANDRA-9526)
and also I found this open issue, might be related to your problem:
The failure detector becomes more sensitive when the network is flakey(CASSANDRA-9536)
Also I find this article about Gossiping and Failure Detection very useful.

Related

Hazelcast causing heavy traffic between nodes

NOTE: Found the root cause in application code using hazelcast which started to execute after 15 min, the code retrieved almost entire data, so the issue NOT in hazelcast, leaving the question here if anyone will see same side effect or wrong code.
What can cause heavy traffic between Hazelcast (v3.12.12, also tried 4.1.1) 2 nodes ?
It holds maps with lot of data, no new entries are added/removed within that time, only map values are updated.
Java 11, Memory usage 1.5GB out of 12GB, no full GCs identified.
Following JFR the high IO is from:
com.hazelcast.internal.networking.nio.NioThread.processTaskQueue()
Below chart of Network IO, 15 min after start traffic jumps from 15 to 60 MB. From application perspective nothing changed after these 15 min.
This smells garbage collection, you are most likely to be running into long gc pauses. Check your gc logs, which you can enable using verbose gc settings on all members. If there are back-to-back GCs then you should do various things:
increase the heap size
tune your gc, I'd look into G1 (with -XX:MaxGCPauseMillis set to a reasonable number) and/or ZGC.

How to avoid query taking long time when adding new node to cluster

I found query (just a simple select query) takes long time when adding a new node to cluster.
My execution time log :
17:49:40.008 [ThreadPoolTaskScheduler-14] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 8 ms
17:50:00.010 [ThreadPoolTaskScheduler-3] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 15010 ms
17:50:15.008 [ThreadPoolTaskScheduler-4] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 10008 ms
17:50:20.008 [ThreadPoolTaskScheduler-16] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 7 ms
Normally it takes about 10ms, and suddently takes 15000ms when adding node.
And I found it stuck because waiting for new node init data
Cassandra log (the new node):
INFO [HANDSHAKE-/194.187.1.52] 2019-05-31 17:49:36,056 OutboundTcpConnection.java:560 - Handshaking version with /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,059 Gossiper.java:1055 - Node /194.187.1.52 is now part of the cluster
INFO [RequestResponseStage-1] 2019-05-31 17:49:36,069 Gossiper.java:1019 - InetAddress /194.187.1.52 is now UP
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [MigrationStage:1] 2019-05-31 17:49:39,347 ViewManager.java:137 - Not submitting build tasks for views in keyspace system_traces as storage service is not initialized
INFO [MigrationStage:1] 2019-05-31 17:49:39,352 ColumnFamilyStore.java:411 - Initializing system_traces.events
INFO [MigrationStage:1] 2019-05-31 17:49:39,382 ColumnFamilyStore.java:411 - Initializing system_traces.sessions
Stuck when : Node /194.187.1.52 is now part of the cluster
And client will wait for new node init all data
What I have tried:
1. I try use consistency with ONE or QUORUM, and is no difference
2. I try turn replication factor to 1, 2 or 3, and still no difference
Why new node become part of cluster when that node not init data completely.
Is there a way to solve this.
I expect when I query to old node, the performance is not influenced by just waiting for new node to init data.
.
.
.
I resolved this problem.
I write wrong config, I let all node become seeds even before they joining cluster, this cause read timed-out during adding new node to cluster.
After fix this. all read is normal, but somehow I found insert query timed-out during adding node.
Finally I tune this to avoid insert timed-out:
/sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
and also change conf to limit throughput
stream_throughput_outbound_megabits_per_sec : 100
Really thanks for help.
This is just a theory, but one possible cause of this is that the new node is being chosen as a coordinator by your driver client, in this case, consistency level and replication aren't the main contributing factor to the delay in servicing your query.
If the new node is slowly performing initially for whatever reason and the driver is sending requests to it, the behavior of the coordinator can impact the servicing of your request.
What exactly is runJob doing? You suggested it is making a single query, but is it possible that it's a range query?
If it's a single query and it's taking as long as 10 seconds, that seems odd as the default read_request_timeout is 5 seconds. If it's a range query (a read involving multiple partitions), the default is 10 seconds. Are you adjusting those timeouts?
When you see responses that long for a single query that could mean the coordinator is what is impeding responsiveness as otherwise if the coordinator was responsive and the replicas were slow, you'd see ReadTimeoutException message serviced to the client.
To better react to these cases, a number of client drivers implement a strategy called 'speculative execution'. As described in the documentation for the DataStax Java Driver for Apache Cassandra:
Sometimes a Cassandra node might be experiencing difficulties (ex: long GC pause) and take longer than usual to reply. Queries sent to that node will experience bad latency.
One thing we can do to improve that is pre-emptively start a second execution of the query against another node, before the first node has replied or errored out. If that second node replies faster, we can send the response back to the client (we also cancel the first execution – note that “cancelling” in this context simply means discarding the response when it arrives later, Cassandra does not support cancellation of in flight requests at this stage)
You could configure your driver to speculatively execute with a constant threshold for idempotent requests (such as reads are). In the 3.x java driver, its done this way:
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withSpeculativeExecutionPolicy(
new ConstantSpeculativeExecutionPolicy(
500, // delay before a new execution is launched
2 // maximum number of executions
))
.build();
In this case, if the coordinator was slow to respond, after 500ms the driver chooses another coordinator and submits a second quest, and the first coordinator to respond wins.
Note that this might cause an amplification of requests sent to your cluster overall, so you want to tune that delay in such a way that it only kicks in when response time is highly anomalous. In your case, if requests normally take less than 10ms, 500ms is probably a reasonable number depending on what your higher percentile latencies look like.
All that being said, if you are able to identify that the problem is the new node behaving poorly as a coordinator. It's worth understanding why. Adding speculative execution could be a nice way of possibly working around the problem, but it's probably better to try to understand why the new node is so slowly performing. Having monitoring in place to observe Cassandra's metrics would likely give great visibility into the problem.
That is a behaviour that you can find with too high consistency, or not having enough copies of data (replication factor). When a new node is added to the cluster, a rearrangement of the ownership of tokens occur, once that it is determined what data will the new node be the owner, it will start streaming that data, which may saturate the network.
In your question you don't mention the network setting or if you are using cloud instances, which have a direct impact for these constraints, for example, an AWS m3.large instance will be more restricted in network capabilities than an i3.4xlarge.
Other variable to consider is disk configuration, if you are using your own hardware look for the cap in IO of your drives settings; if you are on the cloud, using the instance storage, when available, will have a better performance than external volumes ( as AWS EBS; if this is the case, ensure that you are enabling the " EBS optimized" option if your instance allows it)
Usually a RF of 3 with consistency level of Quorum should also help you to prevent the issue.

Elasticsearch cluster size/architecture

I've been trying to setup an elasticsearch cluster for processing some log data from some 3D printers .
we are having more than 850K documents generated each day for 20 machines . each of them has it own index .
Right now we have the data of 16 months with make it about 410M records to index in each of the elasticsearch index .
we are processing the data from CSV files with spark and writing to an elasticsearch cluster with 3 machines each one of them has 16GB of RAM and 16 CPU cores .
but each time we reach about 10-14M doc/index we are getting a network error .
Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-xxxxxxx.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[X.X.X.X:9200]]
I'm sure this is not a network error it's just elasticsearch cannot handle more indexing requests .
To solve this , I've tried to tweak many elasticsearch parameters such as : refresh_interval to speed up the indexation and get rid of the error but nothing worked . after monitoring the cluster we think that we should scale it up.
we also tried to tune the elasticsearch spark connector but with no result .
So I'm looking for a right way to choose the cluster size ? is there any guidelines on how to choose your cluster size ? any highlights will be helpful .
NB : we are interested mainly in indexing data since we have only one query or two to run on data to get some metrics .
I would start by trying to split up the indices by month (or even day) and then search across an index pattern. Example: sensor_data_2018.01, sensor_data_2018.02, sensor_data_2018.03 etc. And search with an index pattern of sensor_data_*
Some things which will impact what cluster size you need will be:
How many documents
Average size of each document
How many messages/second are being indexed
Disk IO speed
I think your cluster should be good enough to handle that amount of data. We have a cluster with 3 nodes (8CPU / 61GB RAM each), ~670 indices, ~3 billion documents, ~3TB data and have only had indexing problems when the indexing rate exceeds 30,000 documents/second. Even then only the indexing of a few documents will fail and can be successfully retried after a short delay. Our implementation is also very indexing heavy with minimal actual searching.
I would check the elastic search server logs and see if you can find a more detailed error message. Possible look for RejectedExecutionException's. Also check the cluster health and node stats when you start to receive the failures which might shed some more light on whats occurring. If possible implement a re-try and backoff when failures start to occur to give ES time to catch up to the load.
Hope that helps a bit!
This is a network error, saying the data node is ... lost. Maybe a crash, you can check the elasticsearch logs to see whats going on.
The most important thing to understand with elasticsearch4Hadoop is how work is parallelized:
1 Spark partition by 1 elasticsearch shard
The important thing is sharding, this is how you load-balance the work with elasticsearch. Also, refresh_interval must be > 30 secondes, and, you should disable replication when indexing, this is very basic configuration tuning, I am sure you can find many advises about that on documentation.
With Spark, you can check on web UI (port 4040) how the work is split into tasks and partitions, this help a lot. Also, you can monitor the network bandwidth between Spark and ES, and es node stats.

High Native-Transport-Requests All time Blocked

After running tpstats on all nodes. I see a lot of nodes having high number of ALL TIME BLOCKED NTR. We have a 4 node cluster and the values for NTR ALL TIME BLOCKED are :
NODE 1: 23953
NODE 2: 2935
NODE 3: 15229
NODE 4: 5951
I know ALL TIME BLOCKED is bad and hence worried as to what I am doing wrong.
This pool handles cql requests, so it is the number of active CQL requests allowed. Its limited to prevent too many active ones from OOMing your system (ie each returning large blobs). This effectively applies backpressure to your client application to slow down. Unfortunately if you have small requests this isnt ideal and hurts your throughput so in CASSANDRA-11363 they added a setting to make the space tradeoff for small bursty workloads.
If you upgrade to 2.2.8+ you can set the max queue size of that threadpool with -Dcassandra.max_queued_native_transport_requests=4096

How to tune "spark.rpc.askTimeout"?

We have a spark 1.6.1 application, which takes input from two kafka topics and writes the result to another kafka topic. The application receives some large (approximately 1MB) files in the first input topic and some simple conditions from the second input topic. If the condition is satisfied, the file is written to output topic else held in state (we use mapWithState).
The logic works fine for less (few hundred) number of input files, but fails with org.apache.spark.rpc.RpcTimeoutException and recommendation is to increase spark.rpc.askTimeout. After increasing from default (120s) to 300s the ran fine longer but crashed with the same error after 1 hour. After changing the value to 500s, the job ran fine for more than 2 hours.
Note: We are running the spark job in local mode and kafka is also running locally in the machine. Also, some time I see warning "[2016-09-06 17:36:05,491] [WARN] - [org.apache.spark.storage.MemoryStore] - Not enough space to cache rdd_2123_0 in memory! (computed 2.6 GB so far)"
Now, 300s seemed large enough a timeout considering all local configuration. But any idea, how to come up to an ideal timeout value instead of just using 500s or higher based on testing, as I see crashed cases using 800s and cases suggesting to use 60000s?
I was facing the same problem, I found this page saying that under heavy workloads it is wise to set spark.network.timeout(which controls all the timeouts, also the RPC one) to 800. At the moment it solved my problem.

Resources