Hazelcast causing heavy traffic between nodes

Hazelcast causing heavy traffic between nodes - io

NOTE: Found the root cause in application code using hazelcast which started to execute after 15 min, the code retrieved almost entire data, so the issue NOT in hazelcast, leaving the question here if anyone will see same side effect or wrong code.
What can cause heavy traffic between Hazelcast (v3.12.12, also tried 4.1.1) 2 nodes ?
It holds maps with lot of data, no new entries are added/removed within that time, only map values are updated.
Java 11, Memory usage 1.5GB out of 12GB, no full GCs identified.
Following JFR the high IO is from:
com.hazelcast.internal.networking.nio.NioThread.processTaskQueue()
Below chart of Network IO, 15 min after start traffic jumps from 15 to 60 MB. From application perspective nothing changed after these 15 min.

This smells garbage collection, you are most likely to be running into long gc pauses. Check your gc logs, which you can enable using verbose gc settings on all members. If there are back-to-back GCs then you should do various things:
increase the heap size
tune your gc, I'd look into G1 (with -XX:MaxGCPauseMillis set to a reasonable number) and/or ZGC.

Related

GC pause (G1 Evacuation Pause) issue in Cassandra

I'm getting the following error in the application logs that trying to connect to a Cassandra cluster with 6 nodes
com.datastax.driver.core.exceptions.OperationTimedOutException: [DB1:9042] Timed out waiting for server response
I have set java heap memory to 8GB (-Xms8G -Xmx8G), wondering if 8 GB is too much?
Below is time out configuration in cassandra.yaml
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 20000
write_request_timeout_in_ms: 10000
request_timeout_in_ms: 20000
In the application there aren't heavy delete or update statements, so my question is what else may cause the long GC pause? the majority types of the GC pause that I can see in the log is G1 Evacuation Pause, what does it mean exactly?

The heap size heavily depends on the amount of data that you need to process. Usually for production workload minimum of 16Gb was recommended. Also, G1 isn't very effective on the small heaps (less than 12Gb) - it's better to use default ParNewGC for such heap sizes, but you may need to tune GC to get better performance. Look into this blog post that explains tuning of the GC.
Regarding your question on the "G1 Evacuation Pause" - look into this blog posts: 1 and 2. Here is quote from 2nd post:
Evacuation Pause (G1 collection) in which live objects are copied from one set of regions (young OR young+old) to another set. It is a stop-the-world activity and all
the application threads are stopped at a safepoint during this time.
For you this means that you're filling regions very fast and regions are big, so it requires significant amount of time to copy data.

Cassandra 3.10 debug.log contains frequent "FailureDetector.java:457 - Ignoring interval time of..."

The debug.log files for one of our Cassandra 3.10 clusters has frequent messages similar to “FailureDetector.java:457 - Ignoring interval time of…”
The messages appear even if the cluster is idle. I see the messages at a rate of about 1 per second on each node of this 6 node cluster (3 nodes each in two data centers).
Can someone tell me what causes the messages and if they are something to be concerned about?
We have a couple of other small clusters supporting the same application (different environments) and I see this message much less often (days apart).

The FailureDetector is responsible of deciding if a node is considered UP or DOWN.
The gossip process tracks state from other nodes both directly (nodes
gossiping directly to it) and indirectly (nodes communicated about
secondhand, third-hand, and so on). Rather than have a fixed threshold
for marking failing nodes, Cassandra uses an accrual detection
mechanism to calculate a per-node threshold that takes into account
network performance, workload, and historical conditions. During
gossip exchanges, every node maintains a sliding window of
inter-arrival times of gossip messages from other nodes in the
cluster.
Here you can find the source code, which gives you the log message. It is set to DEBUG level because they may be helpful in tracking down the actual issue causing the latency, but don't indicate a problem on their own.
In other words: your node measures the acknowledgement latency for each gossip message sent to the other nodes e.g: X nanosec for IP address1, Z nanosec for IP address2, etc. If eitherX or Y is above the expected 2 sec threshold as stated in MAX_INTERVAL_IN_NANO, it will get reported.
Problems, which can cause this log message:
Huge load on the node(s): e.g too many large partitions
High pressure: e.g. too many queries in sort period of time
Bad network connection
The extra FailureDetector logging was added with this:
Expose phi values from failure detector via JMX and tweak debug
and trace logging (CASSANDRA-9526)
and also I found this open issue, might be related to your problem:
The failure detector becomes more sensitive when the network is flakey(CASSANDRA-9536)
Also I find this article about Gossiping and Failure Detection very useful.

Couchdb views crashing for large documents

Couchdb keeps crashing whenever I try to build the index of the views of a design document emitting values for large documents. The total size of the database is 40 MB and I guess the documents are about 5 MB each. We're talking about large JSON without any attachment.
What concerns me is that I have 2.5 GB of free ram before trying to access the views but as soon as I try to access them, the CPU usage raises to 99% and all the free RAM gets eaten by erl.exe before the indexing fails with exit code 1.
Here is the log:
[info] 2016-11-22T22:07:52.263000Z couchdb#localhost <0.212.0> -------- couch_proc_manager <0.15603.334> died normal
[error] 2016-11-22T22:07:52.264000Z couchdb#localhost <0.15409.334> b9855eea74 rexi_server throw:{os_process_error,{exit_status,1}} [{couch_mrview_util,get_view,4,[{file,"src/couch_mrview_util.erl"},{line,56}]},{couch_mrview,query_view,6,[{file,"src/couch_mrview.erl"},{line,244}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
Views skipping these documents can be accessed without issue. Which general guidelines could you provide me to help with this kind of situation? I am using couchdb 2.0 on windows.
Many thanks
Update : I tried to limit the number of view server instances to 1 and vary the max RAM allowed for couchjs, but it keeps crashing. Also I noticed that even though CouchDb is supposed to pass only one document at a time to the view server, erl.exe keeps eating all the available RAM (3GB used for three 5mb docs to update...). Initially I thought this could be because of the multiple couchjs instances but apparently this isn't the case.
Update : Made some progress, now it looks like the indexing is progressing well for just less than 10 minutes then erl.exe crashes. I have posted the dump here (just to clarify "well" means, 99% CPU usage and computer screen completely frozen).

What is spark.driver.maxResultSize?

The ref says:
Limit of total size of serialized results of all partitions for each
Spark action (e.g. collect). Should be at least 1M, or 0 for
unlimited. Jobs will be aborted if the total size is above this limit.
Having a high limit may cause out-of-memory errors in driver (depends
on spark.driver.memory and memory overhead of objects in JVM). Setting
a proper limit can protect the driver from out-of-memory errors.
What does this attribute do exactly? I mean at first (since I am not battling with a job that fails due to out of memory errors) I thought I should increase that.
On second thought, it seems that this attribute defines the max size of the result a worker can send to the driver, so leaving it at the default (1G) would be the best approach to protect the driver..
But will happen on this case, the worker will have to send more messages, so the overhead will be just that the job will be slower?
If I understand correctly, assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize). If so, then increasing that attribute to protect my driver from being assassinated from Yarn should be wrong.
But still the question above remains..I mean what if I set it to 1M (the minimum), will it be the most protective approach?

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize).
No. If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from driver loss, nothing more.
if I set it to 1M (the minimum), will it be the most protective approach?
In sense yes, but obviously it is not useful in practice. Good value should allow application to proceed normally but protect application from unexpected conditions.

Severe degradation in Cassandra Write performance with continuous streaming data over time

I notice a severe degradation in Cassandra write performance with continuous writes over time.
I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.
Streaming data is written from data generator (4 instances, each with 256 threads) inserting data into multiple rows in parallel.
Additionally, data is also inserted into a column family that has indexes over DateType and UUIDType.
CF1:
Col1 | Col2 | Col3(DateType) | Col(UUIDType4) |
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
The no. of data points inserted/sec decreases over time until no further inserts are possible. The initial performance is of the order of 60000 ops/sec for ~6-8 hours and then it gradually tapers down to 0 ops/sec. Restarting the DataStax_Cassandra_Community_Server on all nodes helps restore the original throughput, but the behaviour is observed again after a few hours.
OS: Windows Server 2008
No.of nodes: 5
Cassandra version: DataStax Community 1.2.3
RAM: 8GB
HeapSize: 3GB
Garbage collector: default settings [ParNewGC]
I also notice a phenomenal increase in the no. of Pending write requests as reported by the OpsCenter (~of magnitude 200,000) when the performance begins to degrade.
I fail to understand what is preventing the write operations to be completed and why do they pile up over time? I do not see anything suspicious in the Cassandra logs.
Has the OS settings got anything to do with this?
Any suggestions to probe this issue further?

Do you see an increase in pending compactions (nodetool compactionstats)? Or are you seeing blocked flush writers (nodetool tpstats)? I'm guessing you're writing data to Cassandra faster than it can be consumed.
Cassandra won't block on writes, but that doesn't mean that you won't see an increase in the amount of heap used. Pending writes have overhead, as do blocked memtables. In addition, each SSTable has some memory overhead. If compactions fall behind this is magnified. At some point you probably don't have enough headroom in your heap to allocate the objects required for a single write, and you end up spending all your time waiting for an allocation that the GC can't provide.
With increased total capacity, or more IO on the machines consuming the data you would be able to sustain this write rate, but everything indicates you don't have enough capacity to sustain that load over time.

Bringing your write timeout in line with the new default in 2.0 (of 2s instead of 10s) will help with your write backlog by allowing load shedding to kick in faster: https://issues.apache.org/jira/browse/CASSANDRA-6059

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string