Spike in Tomcat Response times - garbage-collection

I have been facing the intermittent spikes in the Rest API responses hosted on Tomcat. General response time is within 2 ms. But there are some spikes in between where one request took more than 1.5 seconds, these requests are causing the timeouts at the client end as the client is configured with very low connection timeout. This spike is occurring with in every 1 hour to 1:30 hr. There is no spike in CPU and Memory. The application is fetching Data from Redis and there are no spikes in the Redis machines as well. The number of requests processed per second is 500. The thread pool is always under utilized. Following is the Tomcat configuration.
<Connector port="8080"
connectionTimeout="60000"
maxThreads="500"
minSpareThreads="50"
acceptCount="2000"
protocol="org.apache.coyote.http11.Http11NioProtocol"
useSendfile="false"
compression="force"
enableLookups="false"
redirectPort="8443" />
The RAM of the machine is 8GB and the JVM is configured with XMS and XMX as 4GB. I am not using any explicit GC arguments. (Tomcat 9.0.26, Java 11, 4 core,8 GB RAM)
I suspect GC might be causing the issue, but as I don't see any spike in either CPU or memory I don't have any clue why this is happening. Can anyone help me by throwing some ideas in resolving this issue?

Related

Node process often loses CPU time on Linux VM which increases latencies for client requests

Problem - Increase in latencies ( p90 > 30s ) for a simple WebSocket server hosted on VM.
Repro
Run a simple websocket server on a single VM. The server simply receives a request and then upgrades it to websocket without any logic. The client will continuously send 50 parallel requests for a period of 5 mins ( so approximately 3000 requests ).
Issue
Most requests have a latency of the range 100ms-2s. However, for 300-500 requests, we observe that latencies are high ( 10-40s with p90 greater than 30s ) while some have TCP timeouts (Linux default timeout of 127s).
When analyzing the VM processes, we observe that when requests are taking a lot of time, the node process loses it CPU share in favor of some processes started by the VM.
Further Debugging
Increasing process priority (renice) and i/o priority (ionice) did not solve the problem
Increasing cores and memories to 8 core, 32 GiB memory did not solve the problem.
Edit - 1
Repro Code ( clustering enabled ) - https://gist.github.com/Sid200026/3b506a9f77cfce3fa4efdd1ec9dd29bc
When monitoring active processes via htop, we find that the processes started by the following commands are causing an issue
python3 -u bin/WALinuxAgent-2.9.0.4-py2.7.egg -run-exthandlers
/usr/lib/linux-tools/5.15.0-1031-azure/hv_kvp_daemon -n

Socket.io hangs after 2k concurrent connections

I have purchased 2 vCPU core and 4 GB Ram memory VPS server and deployed nodejs Socket.io server. its working fine without any issue upto 2k concurrent connection. But this limit is very small according to me. when connection is reached at 3k socketio server hang and stopped working.
Normally memory usage is 300mb but after 3k connection memory usage is reaching upto 2.5 GB and not emitting packets for several seconds and after that works for very few second and server hang again.
My server is not very small for this amount of connection.
Is there any suggestion for optimisations how to increase concurrent connection without hang after few thousand clients connected simultaneously. for few clients its working fine.

Fail fast Cassandra NTR blocked tasks

We ran into an issue where a Cassandra node goes down in a cluster of 18 nodes and the overall cluster read/write latencies spike up due to which the Native Transport requests threads reach maximum capacity of 128 (default) and the NTR max queued capcity is reached (128 default) and the native-transport requests starts getting blocked.
I am not sure what blocked requests mean here? does cassandra starts failing the incoming requests until the queue is full? or the requests are blocked on the server side until they time out.
If it's the latter, is it possible to fail fast these requests from the Cassandra server side?
we are using Apache Cassandra version 2.2.8 with Datastax Cassandra java driver 3.0.0
You can increase the number of concurrent requests to coordinate which is a common enough configuration with many tiny requests with -Dcassandra.max_queued_native_transport_requests=4096 with 2.2.8+. There is no feature to have it return an error instead of blocking but the back pressure will be noticed on the client and queued there until you get busy pool exceptions.

com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response

We are using Apache Cassandra-v3.0.9 with com.datastax.cassandra:cassandra-driver-core:3.1.3. Our application works good all the time, but once in a week we start getting the following exception from our applications:
com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:62)
at com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
at com.til.cms.graphdao.cassandra.dao.generic.CassandraICMSGenericDaoImpl.getCmsEntityMapForLimitedSize(CassandraICMSGenericDaoImpl.java:2824)
.....
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
These applications are hitting Cassandra datacenter for read requests. The data-center consists of 5 physical servers each with 2 disks, 64 GB RAM, 40 cores, 16GB heap with G1 GC.
There was no problem with Cassandra servers as per our investigation like there was no load average/iowait increase, gc pauses or nodetool/cqlsh connectivity etc. We just started getting these exceptions in our application logs until we restarted Cassandra servers. This exception was reported randomly for different Cassandra servers in the datacenter and we had to restart each of them. In normal time each of these Cassandra server servers 10K read requests/seconds and hardly 10 write requests/seconds. When we encounter this problem read requests are dramatically affected to 2-3 K/seconds.
The replication factor of our cassandra datacenter is 3 and following is way we are making connections
Cluster.builder()
.addContactPoints(nodes)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_ONE))
.withLoadBalancingPolicy(DCAwareRoundRobinPolicy.builder().withLocalDc(localDatacenter).build())
.withSpeculativeExecutionPolicy(PerHostPercentileTracker.builder(13000).build())
.build()
EDIT:
We have observed before we start getting these exceptions, we getting the following WARN level exceptions in our java application.
2018-04-03 23:40:06,456 WARN [cluster1-timeouter-0]
com.datastax.driver.core.RequestHandler [RequestHandler.java:805] Not
retrying statement because it is not idempotent (this message will be
logged only once). Note that this version of the driver changes the
default retry behavior for non-idempotent statements: they won't be
automatically retried anymore. The driver marks statements
non-idempotent by default, so you should explicitly call
setIdempotent(true) if your statements are safe to retry. See
https://docs.datastax.com/en/developer/java-driver/3.1/manual/retries/ for more details.
2018-04-04 00:04:24,856 WARN [cluster1-nio-worker-2]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16632 ms,
which exceeds the configured maximum trackable value 13000
2018-04-04 00:04:24,858 WARN [cluster1-timeouter-0]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16712 ms,
which exceeds the configured maximum trackable value 13000

Linux Server | Tomcat 7.0 Exceptions

I am facing these exceptions regularly, everytime I have to restart the server. The Exceptions are:
exception:1-> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet success
fully received from the server was 61,316,033 milliseconds ago. The last packet
sent successfully to the server was 61,316,034 milliseconds ago. is longer than
the server configured value of 'wait_timeout'. You should consider either expir
ing and/or testing connection validity before use in your application, increasin
g the server configured values for client timeouts, or using the Connector/J con
nection property 'autoReconnect=true' to avoid this problem.
exception:2->Exception in thread "ajp-bio-8009-exec-106" java.lang.OutOfMemoryError: PermGen
space
exception:3->Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
The server specifications are: 1 GB RAM| 1000 GB Bandwidth
Please also give me tips on how to manage the RAM amongst the various applications on the server and how to self kill the process which are not in working now. According to my research what I believe is processes are not getting killed and the memory once allocated to them is not refreshed back and get aligned for that process till the server reboots. So please help me with this.
I will be really grateful!
Thanks & Regards
Romel Jain
for the permgen space error, maybe cloud you add some jvm options (or CATALINA_OPTS) like this :
-XX:MaxPermSize=256m -XX:+CMSClassUnloadingEnabled
XX:MaxPermSize : permanent space size
XX:+CMSClassUnloadingEnabled: allow the jvm to unload unused class definitions
i was talking about this error in an old french post here

Resources