Inconsistent Tomcat response time - multithreading

My Tomcat (8.5.8, behind Apache HTTPD) sometimes (< 1 percent) has high waited time (result from Glowroot).
JVM Thread Stats
CPU time: 422.3 milliseconds
Blocked time: 0.0 milliseconds
Waited time: 26,576.0 milliseconds
Allocated memory: 73.3 MB
The provided stack trace shows Tomcat was waiting for a latch. At that time my Tomcat host memory, CPU, disk IO, garbage collection looks good.
I tried to issue the same request and same data (compressed size 2.5MB, original size about 25MB) was returned. Most of the times were okay (< 2 seconds) but few times it took long (> 20 seconds).
com.fasterxml.jackson.core.json.UTF8JsonGenerator.writeString(UTF8JsonGenerator.java:432)
com.fasterxml.jackson.core.json.UTF8JsonGenerator._writeStringSegments(UTF8JsonGenerator.java:1148)
com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2003)
org.springframework.security.web.context.OnCommittedResponseWrapper$SaveContextServletOutputStream.write(OnCommittedResponseWrapper.java:540)
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:96)
org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:369)
org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:391)
org.apache.catalina.connector.OutputBuffer.append(OutputBuffer.java:713)
org.apache.catalina.connector.OutputBuffer.flushByteBuffer(OutputBuffer.java:808)
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:351)
org.apache.coyote.Response.doWrite(Response.java:517)
org.apache.coyote.ajp.AjpProcessor$SocketOutputBuffer.doWrite(AjpProcessor.java:1469)
org.apache.coyote.ajp.AjpProcessor.access$900(AjpProcessor.java:54)
org.apache.coyote.ajp.AjpProcessor.writeData(AjpProcessor.java:1353)
org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:361)
org.apache.tomcat.util.net.SocketWrapperBase.writeBlocking(SocketWrapperBase.java:419)
org.apache.tomcat.util.net.SocketWrapperBase.doWrite(SocketWrapperBase.java:670)
org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite(NioEndpoint.java:1259)
org.apache.tomcat.util.net.NioSelectorPool.write(NioSelectorPool.java:157)
org.apache.tomcat.util.net.NioBlockingSelector.write(NioBlockingSelector.java:114)
org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.awaitWriteLatch(NioEndpoint.java:1109)
org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.awaitLatch(NioEndpoint.java:1106)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
sun.misc.Unsafe.park(Native Method)
TIMED_WAITING
The waiting is due to what reason? Connectivity issue between client? Slow client? Connectivity issue between Apache HTTPD and Tomcat? Not enough file descriptor?
What is the latch at here?

Related

Write high bandwidth real-time data to SSD in Linux

I have a real-time process that receives 16 kB of data every 200 us for about 1 hr. I need to store this data.
I have a 240 GB SSD on a SATA III channel and I thought I could use it as a plain storage device without any filesystem on it. I am running 5.4.0-109-generic kernel with 8 GB or ram.
Here is what I have done so far:
I set up a shared memory shm, dimension of 1 GB, where I write the data and I use a semaphore to tell a logger process when data is available.
In the logger process:
I open the SSD:
fd = open("/dev/sdb",O_WRONLY|O_LARGEFILE);
I wait for the data to be available in the shm and then I write a chunk, writing_size, of it to the SSD:
written_size = write(fd,local_buffer,writing_size);
I checked and written_size is always equal to writing_size;
performs an fsync() after N cycles:
if(written_cycles > N)
ret = fsync(fd);
I checked and fsync never returns -1.
I did set the I/O scheduler of /dev/sdb as noop and I did experiment with different writing_size and N. The final values I came up with are writing_size = 64 kB and N = 16.
The behavior that I am seeing is this:
the whole process works very well up until 17 GB have been written. At that point the logger process is being put to "uninterruptible sleep (D)" quite often and for quite some time, 1 or 2 seconds. This is still fine as the shared memory buffer will fill up in ~13 sec. When the data written reaches 20 GB, the logger process is being put to sleep for way longer, until it reaches 13 sec and I start to lose data. The threshold when the process is starting to being put to sleep is quite repeatable, 16 - 17 GB, but the maximum amount of data I can save before I lose it is random.
This is the best I can achieve so far with my method and the writing_size and N tuning mentioned previously.
I tried to set the logger process nice to -20 with no improvements.
It also looks like the noop I/O scheduler does not support the ionice so I tried the CFQ scheduler with maximum ionice but still got worse performances.
I bet the logger process is being put to sleep for I/O access but I do not understand why it happens after a certain number of bytes have been written. iotop shows that the I/O bandwidth of the logger process is stable around 85 MB/s.
I welcome any suggestions.
PS: I did try to mmap the the SSD and do memcpy instead of write()+fsync() but mmap is slower and results are worse.

How "real-time" are the FIFO/RR schedulers on non-RT Linux kernel?

Say on a non-RT Linux kernel (4.14, Angstrom distro, running on iMX6) I have a program that receives UDP packets (< 1400 bytes) that come in at a very steady data rate. Basically,
the essence of the program is:
while (true)
{ recv( sockFd, ... );
update_loop_interval_histogram(); // O(1)
}
To minimize the maximally occuring delay time (loop intervals), I started my process with:
chrt --fifo 99 ./programName
setting the scheduler to a "real-time" mode SCHED_FIFO with highest priority.
the CPU affinity of my process is fixed to the 2nd core.
Next to that, I ran a benchmark program instance per core, deliberately getting the CPU load to 100%.
That way, I get a maximum loop interval of ~10ms (vs. ~25ms without SCHED_FIFO). The occurence of this is rare. During e.g. an hour runtime, the counts sum of all intervals <400µs divided by the sum of all counts of all other interval time occurences from 400µs to 10000µs is over 1.5 million.
But as rare as it is, it's still bad.
Is that the best one can reliably get on a non-RealTime Linux kernel, or are there further tweaks to be made to get to something like 5ms maximum interval time?

How could uptime command show a cpu time greater than 100%?

Running an application with an infinite loop (no sleep, no system calls inside the loop) on a linux system with kernel 2.6.11 and single core processor results in a 98-99% cputime consuming. That's normal.
I have another single thread server application normally with an average sleep of 98% and a maximum of 20% of cputime. Once connected with a net client, the sleep average drops to 88% (nothing strange) but the cputime (1 minute average) rises constantly but not immediately over the 100%... I saw also a 160% !!?? The net traffic is quite slow (one small packet every 0.5 seconds). Usually the 15 min average of uptime command shows about 115%.
I ran also gprof but I did not find it useful... I have a similar scenario:
IDLE SCENARIO
%time name
59.09 Run()
25.00 updateInfos()
5.68 updateMcuInfo()
2.27 getLastEvent()
....
CONNECTED SCENARIO
%time name
38.42 updateInfo()
34.49 Run()
10.57 updateCUinfo()
4.36 updateMcuInfo()
3.90 ...
1.77 ...
....
None of the listed functions is directly involved with client communication
How can you explain this behaviour? Is it a known bug? Could be the extra time consumed in kernel space after a system call and lead to a calculus like
%=100*(apptime+kerneltime)/(total apps time)?

zero read requests but high average IO queue size

I have a 4GB file that are just rows of sorted key,value pairs that I have "primed" into memory (I have 32GB of memory) by
cat file_name > /dev/null
Then from a program I access the rows by the key. When I spawn 1000 threads to simultaneously read from the file and monitor the IO using iostat -xkd I see 0 r/s(read requests) but 45 avgqu-sz (average queue size). The 0 r/s makes sense to me since the pages from the file are most likely in the page cache due to the command above. Are the IO requests still queued even if the pages are in the page cache? If only the actual disk reads are queued, how do I explain this phenomenon?

How to find the processor queue length in linux

Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."
sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2
The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.
uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?

Resources